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PREFACE 



Dear readers, 



Although it is well-known that confidentiality, integrity and availability are high- 
level objectives of information security, much of the attention in the security arena 
has been devoted to the confidentiality and availability aspects of security. IFIP 
TC-11 Working Group 11.5 has been charged with exploring the area of the 
integrity objective within information security and the relationship between 
integrity in information systems and the overall internal control systems that are 
established in organizations to support the corporate governance codes. 



In this collection you will find the papers that have been presented during the 
second working conference dedicated to the subject. Also some information about 
IFIP TC-1 1 and its working groups is included. 



The seond working conference of working group 1 1 .5 continues the ongoing dialog 
between the information security specialists and the internal control specialists so 
that both may work more effectively together to assist in creating effective business 
systems in the future. The goals for this and following conferences are to find an 
answer to the following questions: 

• what precisely do business managers need in order to have confidence in the 
integrity of their information systems and their data; 

• what is the status quo of research and development in this area; 

• where are the gaps between business needs on the one hand and research and 
development on the other and what needs to be done to bridge these gaps. 

The results of the working conference, both in the papers presented and the 
outcome of the panel sessions, will be the basis for the future direction of the 
activities of the working group. The cooperation with other organizations that have 
an interest in this area will be further expanded in the forthcoming years. 





Vlll 



If you have missed the chance to explore the field of integrity and internal control 
in information systems this year, take the opportunity to contribute next year to the 
debate with colleagues to further the development of reliable information systems 
and submit a paper or participate in the working conference. 

We would like to thank all individuals and organizations that have made it possible 
for this working conference to take place and all the authors of the papers 
submitted to the working conference. 
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Abstract 

Data integrity policies often require that quality and integrity metadata be 
generated and communicated to potential users. However, in data warehouses, 
federations, and other multi-tier databases, administrators at different tiers use 
different schemas. Data at the upper tier is derived from the lower, so we consider 
the upper tier’s tables to be views, derived by SQL-like expressions. 

Unfortunately, an assertion about some granule in the sources (a table, column, or 
cell) is often meaningless to view users, and vice versa. An understanding of the 
SQL view gives intuitive guidance for propagating such metadata, but not explicit 
semantics. 

It appears feasible to create a system that drastically reduces the skill and labor 
required for propagating metadata and events between the tiers. We show many 
examples where, based on the view query and the metadata on the relevant sources, 
one can automatically generate useful propagation rules. Propagation downward 
from views to sources is also handled. Our approach is to automate the easy cases 
(which we expect to be quite common), and to assist on harder cases. Knowledge 
of specific metadata types or query operators can be supplied incrementally. If the 
view’s query expression is difficult, one may compose the propagation rules from 
its constituent operators. 



Keywords 

derived data, view, metadata, multi-tier, integrity 
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1 INTRODUCTION 



A data integrity strategy is likely to involve large amounts of metadata (e.g., 
quality measures, constraints) plus operations and events (e.g., corrections, error 
messages). This information helps users employ the data correctly, and helps 
managers plan data quality improvements. Our work explores techniques for 
coordinating and propagating such information in a multi-tier database. 

A multi-tier database is one that provides several different virtual or physical 
databases, each one derived from the one below. The phenomenon takes many 
forms. For example, a set of view tables can be used to insulate applications from 
stored tables that have been partitioned or denormalized, and which change as the 
workload changes. A federated database provides a virtual schema above multiple 
sources. A data warehouse gathers and transforms data and stores it in a separate 
server; this can be seen as computing a materialized view (subject to delays in 
propagating source updates). Web-oriented distributed systems often have tiers of 
objects derived from each other. 

Because the data in the tiers of a multi-tier database are related, it is important that 
the tiers maintain full and consistent integrity information. However, integrity 
metadata specified at a tier is usually local to that tier and is not propagated to 
other tiers. The reason is that users and administrators lack the time, motivation, 
the skill in understanding data derivations (e.g., SQL), or the business knowledge 
needed to propagate each item of new or changed metadata to all interested parties. 
One cannot ask administrators to write a separate piece of code for each metadata 
type on each attribute, let alone for each row or cell that has metadata attached. 

The goal of our research is to extend the ability of database systems to support tiers 
as views. Users at any tier should have the illusion that their database is single- 
tier, though perhaps having multiple administrators. In such a system, all relevant 
metadata would be propagated to each tier, and made accessible in terms of the 
schema at the tier. A simplified picture of the system appears below in Figure 1. 

The metadata-propagation problem is comparable to the well-known problem of 
view update semantics (Keller, 1986) that has daunted database theorists. 
Consequently, we acknowledge that a fully automatic solution is not possible in all 
cases. Our intention is therefore to automate the easy cases, and provide 
automated assistance for the hard ones. 

There are numerous types of metadata, many of which are domain-specific. 
Consequently, the kinds of metadata (and the options for how they will be treated) 
cannot be provided as a turnkey system. Instead, the system should permit 
extensions by tool vendors, customer organizations, and even business-oriented 
data administrators. To this end, we propose a framework that allows semantic 
choices to be supplied as small chunks of knowledge, rather than modifications to 
a query processor. 
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Figure 1 : System Overview 



This paper is organized as follows. Section 2 discusses the need for integrity 
information in multi-tier databases, and how this information can be used 
effectively across tiers. Section 3 elaborates our proposed framework, and 
provides examples illustrating its components and its use. Section 4 provides 
conclusions, plans for future work, and some open issues in this area. Some 
preliminary results from this paper appeared in (Rosenthal, 1997). 

2 DATA INTEGRITY AND METADATA 

2.1 Data integrity requirements of large databases 

Larger, more complex databases tend to require greater attention to data integrity 
and other controls, beyond the mechanisms used in simpler systems. Multi-tier 
databases, especially those that integrate data from many sources, are no exception 
For example, the data warehouse literature reports that data integration and “data 
scrubbing” consume as much as 60-80% of the warehousing effort (Inman, 1996), 
(Robinson 1996). 
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Some of the reasons apply to any large system, regardless of architecture: 

• Manual checking cannot handle the volume either of existing data or of new 
arrivals. 

• Users’ access to databases was traditionally mediated by applications, which 
often included integrity protections and limited the data returned (thereby also 
enhancing security). Now easy-to-use ad-hoc query interfaces make it feasible 
for many users to bypass this mediation. 

• User bases are growing. A larger user base justifies greater expenditures. 

• Finally, the data and tool sets are valuable, and should be made available to a 
wide span of users. Yet sensitive information will be withheld unless it can be 
protected. The larger scale of data and users (potentially thousands of data 
attributes and of users) makes security administration and enforcement serious 
problems. 

Some other factors have been observed (by us and others) to apply especially to 
mechanisms that provide integrated views across multiple sources: 

• Errors that went unnoticed when data was separate become painfully apparent 
when conflicting data is brought together. Improved data quality tests (e.g., 
consistency checks) may create a perception of decreased quality. 

• Users at view tiers are often less intimately familiar with the underlying source 
data, and hence less able to compensate for faults. 

• Warehouse users often use summary data (e.g., totals, averages), which may 
hide the underlying errors. 

• Data that was appropriate for its original purpose may not suit new goals, e.g., 
matching against other data sources. The variations among the sources’ 
attitudes, policies, and practices contribute to uneven quality. 

• In an integrated database, the source that gathers certain data may not be a user 
of that data. So there is no natural internal feedback to ensure quality. 

2.2 Uses of Propagated Metadata, Events, and Operations 

Metadata is stored in the database, and can be queried using standard data query 
languages (such as SQL). The result of a query could include table references 
(“Which tables contain data supplied by Dow Jones?”), attribute references (“What 
attributes in the SUPPLIERS table have unreliable data?”), or data (“Which 
records in EMPLOYEE were last updated by user ‘billgates’?”). This subsection 
discusses the ways in which users will want to access different kinds of integrity 
metadata. 

Data Quality 

Data quality metadata might describe a granule’s sources and processing history, 
its credibility, its error bounds (absolute or relative), and its availability (if 
connectivity is intermittent). Hundreds of other potentially useful types have been 
identified (Wand, 1996), so extensibility is essential. 
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End users and developers who work with view tables can use data quality metadata 
to select data to use (i.e., via query predicates that test quality information) and to 
interpret the data they do use. In both cases, they want descriptions in terms of 
their familiar views, not of source tables. Hence we need to propagate quality 
information upward. 

Managers can use quality metadata to guide quality improvement, by comparing 
(in their native view) data quality with the business requirements for quality. When 
the existing quality falls short, the desired quality on view attributes needs to be 
propagated down to the source tables. That is, we want a “complaint” facility that 
propagates the complaints down to the correct source metadata. If other users 
should be made aware of the doubts (e.g., for scientific data or operational 
planning), shared complaints expressed against source data might then be 
propagated upward to other views. 

Integrity Constraints 

Today’s databases have many constraints defined on source tables. View users 
need to see them as constraints on view tables. Also, once a predicate is visible at 
both source and view tiers, it can be enforced at either place. Enforcing in clients 
permits faster notification of errors, and permits data entry when disconnected 
from the source databases. 

A radical alternative is to say that constraint administration should be done mainly 
at the view level, not at the source level. After all, business domain experts are 
more knowledgeable about constraints than the technicians who design source 
tables. Each expert might define constraints for a portion of the database, 
expressed in terms of user views (e.g., the portion that they are primarily 
responsible for populating, e.g., transport or finance). Propagating these 
constraints down to source tables allows them to be enforced for all updates, not 
just updates through this view, and to use indexes maintained on the source tables. 
Also, constraints expressed on the source tables can propagate upward to other 
views. 

For downward propagation, constraint predicates on view tables can be seen as 
queries, so query processors can re-express them on the source tables. However, 
this translation may not identify constructs (e.g., key constraints) that are 
meaningful to users or that have efficient enforcement techniques. Also, one needs 
to ask whether the constraint should apply to all updates, or only those received 
through this view. 

Upward propagation is more complex. Ideally, one could devise a set of view 
constraints equivalent to those on the sources. In practice, while some source 
constraints will be expressible as constraints on data in the view tables, others will 
be irrelevant to updates provided through that view schema, and still others will be 
relevant but will require data not visible in the view schema (as discussed under 
the next item). A reasonable compromise would be to propagate source constraints 
upward to views wherever possible, and to indicate that further constraints are also 
enforced. 
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Error Messages 

Trouble ensues when a view user’s update violates an integrity constraint that is 
enforced on source tier data. An error message should describe the violated 
constraint, but a message stated in foreign (source) terms may confuse and even 
anger view users. To the extent possible, the violation should be explained in 
terms of view tables, even if detected at the source. 

This translation is easily automated for simple views, and when the constraint is 
expressible on information supplied with the user’s update. However, some 
constraint violations involve data outside the user’s view. This raises user interface 
issues; perhaps one should describe the error both in hybrid terms and purely in 
source terms. The system should also check whether the view user was authorized 
to read the additional relevant data from the source; if not, there are difficult 
tradeoffs between integrity and confidentiality (Jajodia, 1995). 



3 A SYSTEM FRAMEWORK 

The previous section demonstrated the value of propagating metadata between 
different tiers in a database system. This section describes 2 i framework (that is, 
standards, services, and a repository of meta-metadata) that semi-automatically 
produces propagation rules, and enables new knowledge about propagation to be 
added incrementally and easily. 

To understand the scope of this framework, suppose for a moment that a table at a 
view tier has just been defined in terms of some source tables. The view definition 
language (e.g. SQL) does not specify any metadata for the view. The framework 
therefore must do the following: 

• Determine the types of metadata that should be in the view. 

• Select “upward rules” (or provide the view creator with a choice of rules) that 
specify how view metadata values will be computed from source metadata, as 
well as rules that specify how the source metadata should change in response 
to changes in the view metadata (called “downward rules”). 

• Use these rules to answer user queries about the metadata, and to keep 
metadata values consistent between the two tiers. 



Our intention is for the framework to be an aid to data administration. The rules 
chosen by the framework encode the semantics of the view, and thus ultimately 
require human assistance and verification. The data administrator should always 
be able to override suggested rules or add new rules. Although our discussion 
focuses on metadata, one also wants rules that propagate operations and events. 

The following subsections flesh out this approach. Section 3.1 describes a 2-tier 
database for our running example. Section 3.2 examines how the framework 
determines the kinds of metadata in a view. Section 3.3 considers rules, and how 
they are used to compute metadata. 
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3.1 A Running Example 

We shall illustrate the details of our framework using the following example. The 
source-tier schema contains information about aircraft, their positions, and their 
scheduled flights. Keys are underlined; FLIGHT.AJd is a foreign key reference. 

AIRCRAFT(A_Id, Type, Capacity) 

POSITION(A_Id, LatLong, Speed_Mph, Height) 

FLIGHT(F_Id, A_Id, StartAirfield, EndAirfield, FuelNeeded) 

The view-tier schema contains three tables, defined as follows. 

Define view POSITION_DE__AVION as 

Select AvionNom=AJd, JeSuisIci=LatLong, Vitesse_Kph=Speed_Mph*1.6 
From POSITION 

Define view SCHEDULE as 
Select A.*, F.FJd, F.Fuel 
From AIRCRAFT A, FLIGHT F 
Where A.A Jd = F.AJd 

Define view AIRFIELD_FUEL_NEEDS as 
Select Airfield=StartAirfield, TotalFuel=SUM(Fuel) 

From FLIGHT 
Group by StartAirfield 

The view POSITION_DE_AVION renames POSITION attributes to French (or 
“franglais”) and converts Speed_Mph to kilometers. The view SCHEDULE joins 
FLIGHT and AIRCRAFT, while AIRFIELD_FUEL_NEEDS computes the total 
amount of fuel needed for each airfield’s flights. 

3.2 Granules and their Properties 

Metadata on granules 

A granule is an identifiable subset of a database table to which metadata or 
methods can be attached, e.g., a table, column, row, cell value, or view. Each 
piece of metadata associated with a granule is given a name, called a property. 

In our example, the granule POSITION might have a property Authorizations, 
describing the users authorized to access the table; the granule 
POSITION. Speed_Mph might in addition have the properties Credibility and 
AbsoluteErrorBound, denoting the fact that each data value in the column has the 
same credibility rating and error bound. Note that for the above examples, a value 
attached to a table or column granule describes each cell within it; other metadata 
types could describe properties more global to the granule. Values for specific 
granules override the wider-scope value. 
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We simplify our presentation by assuming that all property values are stored in one 
global metadata table that has three columns: Granule, Property, and Value. Thus 
one of the above properties might be represented in the metadata table as the row 

(POSmON.Speed_Mph, Credibility, “Gary says 0.6”) 

In practice, this global table is likely to be a view that draws data from user data 
tables, CASE tool metadata tables, and system catalogs. However, a fuller 
treatment here would focus attention on irrelevant technical details (e.g., resolving 
inheritance and overriding, rules to propagate meta-metadata, and naming 
conventions for granules). 



Derivation Queries and Derivation Trees 

To generate candidate metadata for a view granule, we first gather the metadata on 
relevant source granules. The task of identifying these granules is fairly 
straightforward. The basic idea is to exclude data and computation that are 
irrelevant to the view granule For a granule g of a view V, the initial derivation 
query (denoted idq(g,V)) is defined as follows: 

• If g is an attribute A (i.e., column), use “Select V.A From V” 

• If g is a cell, attribute A of row r, use “Select V.A From V where row_id = r” 

• If g is the entire view, use “Select * from V” 

Any query equivalent to idq(g,V) is called a derivation query, and denoted dq(g,V). 
Query simplifications may be performed (e.g., replace V by its defining query, 
exploit integrity constraints). Algorithms developed for monitoring changes to 
views would seem to apply here, though we have not yet made specific 
connections. In particular, pushing projections down allows us to ignore metadata 
on attributes that are immediately projected away; selections allow us to ignore 
metadata on irrelevant cells. 

For example, consider the views defined at the beginning of this section. Queries 
will be shown as trees, in relational algebra, and tree terminology will be used 
freely (e.g., calling the inputs of dq(g,V) the leaves of a derivation tree). The 
derivation tree for POSITION_DE_AVION.Vitesse_Kph appears in Figure 2a; the 
derivation tree for SCHEDULE. A_Id appears in Figure 2b. 

The derivation tree for a granule may be much simpler than the underlying view 
query. When SCHEDULE is projected solely on PLIGHT attributes, the join with 
AIRCRAFT is irrelevant (since the foreign key constraint implies that each 
FLIGHT matches exactly one AIRCRAFT). Hence the derivation query for 
SCHEDULE.Fuel (see Figure 2c) contains only the node for FLIGHT.Fuel. This 
sort of reasoning would require skill and care from a data administrator, but is easy 
to automate. 
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(a) (b) (c) 



Figure 2: Example Derivation Trees 



Suggested Granule Properties 

When the initial derivation query for a view granule g is simplified, what remains 
are the source granules necessary to compute g. Thus it is not unreasonable to 
assume that the properties of the source granules will “filter up” to g. 

We express this intuition by defining the suggested set of properties for view 
granule g as {PI P is a property of a granule in a leaf of dq(g, V)} 

The data administrator is free to omit some suggested properties and add others. In 
fact, the non-presence of a particular property in the view might be cause for a 
fruitful negotiation between the administrators of the two tiers. 



3.3 Propagation Rules and Property Value Computation 

Given a view granule g, the values of its properties will be calculated by assigning 
propagation rules (or just rules) to the operators in g’s simplified derivation tree. 
Informally, each rule specifies a function that propagates metadata up (or down) 
the derivation tree. This section describes the structure and administration of these 
rules, as well as the mechanism for using rules to calculate property values of view 
granules. 

Rules 

A rule has four components: a direction, a computation, a scope, and a strength. 

We discuss each component in turn, and then present examples. 

The direction of a rule is either “upward” or “downward”. An upward rule uses 
the values of the leaf granules to calculate a property value for the result. A 
downward rule uses the property value for a (single) view granule to calculate 
values for the leaf nodes. A rule specified as “both” is shorthand for two rules, one 
for each direction. 
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The computation of a rule is a function that determines how the output(s) will be 
calculated from the input(s). 

The scope of a rule specifies when the rule is applicable to a granule. The scope 
specification makes it possible to define rules at the most general possible level, so 
as to promote sharing and better abstraction. In our scope, we can specify the 
tables, attributes, properties, and the atomic query operators to which the rule 
applies. 

The strength of a rule specifies the extent to which the system should automatically 
use it. We currently have three possible values. A definitive rule is to be applied 
automatically, without user interaction. A default rule is to be preferred by the 
system, but is subject to user verification. And a candidate rule is one of perhaps 
several equal possibilities to be presented to the user. More sophistication is clearly 
possible (e.g., overriding, removing candidates, dependence on user, type- 
checking, etc.). However, we chose to leave such enhancements until we have 
better experience with the simpler scopes. 

Examples of Rules 

The most generally applicable metadata-derivation rule is “do nothing” - that is, to 
pass the property value unchanged up (or down) the derivation query. This rule 
seems always applicable to the RENAME operator, and often applicable to 
MULTIPLY. (TTie property Credibility can pass through MULTIPLY unchanged, 
but AbsoluteErrorBound must scale proportionately.) The following rules capture 
these observations: 

R1 Direction: both 

Computation: output = input 

Scope: operation=RENAME, 

tables=ALL, atts=ALL, properties=ALL 
Strength: definitive 

R2 Direction: both 

Computation: output = input 

Scope: operation=MULTIPLY( c ) 

Tables=ALL, atts=ALL, properties=Credibility 

Strength : definitive 

R3 Direction: upward 

Computation: output = input * c 

Scope: operation=MULTIPLY( c ) 

Tables=ALL, atts=ALL, properties=AbsoluteErrorBound 

Strength : definitive 

R4 Direction: downward 

Computation: output = input / c 

Scope: operation=MULTIPLY( c ) 

Tables=ALL, atts=ALL, properties=AbsoluteErrorBound 

Strength: definitive 
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There are many other rules of narrower scope. We would hope that vendors and 
even user organizations would incrementally add these rules to their systems. For 
example, for the SUM aggregation operator and the AbsoluteErrorBound property, 
a rule (with strength “default”) might multiply the input by the number of values 
being aggregated. 

Suppose we have probabilistic estimates of an attribute’s correctness (here, defined 
as the probability of being exactly right) and availability (for fault tolerance, the 
probability of receiving a response from the server that stores the information). 
Then to calculate the correctness and availability properties, one might multiply 
the values from the input properties, with strength “candidate”. 

Other rules might perform more subtle analyses. We create a property type 
Pedigree to capture how each input to a granule’s derivation affects the granule’s 
value. Consider the view SCHEDULE, obtained by joining AIRCRAFT and 
FLIGHT on foreign key A_ID. Because of the foreign key constraint, only 
FLIGHT.Fuel influences SCHEDULE.Fuel. But the pedigree of 
SCHEDULE.Capacity is more complex. AIRCRAFT.Capacity determines the 
value, but since an aircraft could have an arbitrary number of flights, both 
AIRCRAFT. A_Id and FLIGHT.FJd influence which tuples were present, and the 
number of duplicates. 

As a final example, when an operator combines two textual or Boolean fields, the 
result’s Credibility might be set to the minimum (or product) of the input values (if 
purely numeric), or one might concatenate the textual discussions of credibility. 

Invoking Propagation Rules 

Given property P of view granule g, its value is determined as follows. The 
derivation query dq(g, V) is calculated. For every operator in the tree, an 
applicable upward rule is chosen. The computations of the rules are then 
composed to compute the value of the root node, which becomes the value of P. 

For example, consider the view granule POSITION_DE_AVION.Vitesse__Kph and 
its property Credibility. The derivation query tree for this granule was given in 
Figure 2a. We therefore need to choose an applicable rule for each of the two 
operators of the tree. Using the rules defined in Section 3.3, we see that R1 is the 
applicable rule for the RENAME operator, and R2 is the applicable rule for 
MULTIPLY. As both rules have the identity function as their computation, the 
result is that the value of Credibility is the same for this granule as for the granule 
POSmON.Speed_Mph. 

Now consider the property AbsoluteErrorBound for the same view granule. We 
must choose rules for the same derivation tree applicable to this property. The 
applicable rules are R1 for RENAME, and R3 for MULTIPLY. The result is that 
the value for the property will be 1.6 times the value of AbsoluteErrorBound for 
POSmON.Speed_Mph. 
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In addition to selecting upward rules for each view property, downward rules can 
also be selected (either automatically, or with assistance from the view creator). 

By doing so, the view creator links the metadata at both the source and the view, so 
that changes at one tier can be propagated to the other. 

Administering Rules, within a Component Framework 
Managing the rule set includes creating and modifying rules, inspecting what rules 
apply, overriding or removing rules inherited from a wider scope, and selecting 
one of the candidate rules. The system should provide tools for performing all 
these tasks. Vendors, professional administrators, and power users need many of 
the same capabilities, so the tools should be part of the delivered system. 

Rule choice may depend on the derivation query’s logic, domain semantics, and 
organizational policy. Organizations could contribute domain-specific types and 
rules, and database administrators are able to add database-specific rules and 
override existing ones. Some specifications might be supplied when a new 
property or new derivation operator is defined. Others might be created when 
defining a view to serve a particular community or application. Simple tasks might 
be left to run-time users (e.g., confirming defaults, choosing among candidates). 

It is impossible to provide appropriate rules for all properties, through all possible 
atomic query operators (both SQL and user-defined), for all organizations. A 
vendor of a propagation system could provide an initial set of useful rules. But as 
needs expand, both vendors and their customers will need to extend and customize 
the rule base. Thus, the system should be componentized, i.e., should allow simple, 
independent steps to extend the operators, properties, and rules. 

An important aspect of our framework proposal is that it is a component framework 
for propagation rules, i.e., standards and services that enable separately-provided 
components to work together. The system framework would maintain a database 
of rules that is available to all tiers, and interfaces for inspecting, defining, 
modifying, and overriding rules. The framework also provides the facilities for rule 
invocation. Finally, to reduce semantic heterogeneity, a framework must define a 
set of fundamental properties (e.g.. Credibility) and view-derivation operators (e.g.. 
Select, Outerjoin), that all tiers would be encouraged to use. 



4 DISCUSSION, SUMMARY AND FUTURE WORK 

To manage integrity in a multi-tier database, we must propagate integrity metadata 

and events among the tiers. We have tried to illustrate several points: 

• In multi-tier systems, it is essential to propagate ancillary metadata. For each 
metadata or event type, one may want propagation options that are customized 
for particular databases, tables, columns, cell values or other groupings. Since 
every attribute (and many other granules) may be associated with several 
pieces of ancillary information, automated assistance is essential. 

• A framework can be constructed to help componentize propagation 
capabilities, enabling rules and knowledge to be supplied incrementally. The 




17 



framework would be employed at a variety of skill levels, e.g., to write new 
rules, to select appropriately from existing ones, or simply to execute a rule to 
see metadata from other tiers. 

• While the general problem of “first class” views is notoriously hard, the goal 
of providing assistance is attainable. By offering multiple candidate rules, we 
help administrators handle cases where no single rule applies universally. A 
small collection of heuristics, plus knowledge of query operator semantics can 
handle many views. 

• Propagation rules for complex queries can be composed from propagation 
rules of constituent operators, many of which will be simple. I^opagating 
events may involve actions outside the database (e.g. “forward this request via 
email”). 

Our project (Managing Risk in the Data Warehouse) aims to provide the 
framework and simple components that handle some of the easy cases. More 
complex components (e.g., for complex derivation operators) would then be 
plugged in as researchers or vendors produced them. For example, research on data 
quality measures might lead to a component that was expert in transforms of 
precision metadata. 

To illustrate the intended usage, desired capabilities, user roles, and technical 
feasibility, we have developed a demonstration vehicle (a series of screens, without 
real underlying code). The vehicle has helped us identify opportunities and 
difficulties. We also are using it to try to persuade tool vendors to add such 
capabilities to their products. 

There are many challenges here for database researchers. There is no established 
propagation technology for most properties, operations and events. This is not 
surprising for little-studied issues like data quality, but it even applies to simple 
corrections. Potential research areas include view updates after the source has 
changed (e.g., for periodically refreshed materialized views), bulk corrections (i.e., 
translating SQL Update statements), propagation options for additional query 
operators (e.g., propagating error information through views (Kon, 1996)), 
administration of expressions composed of multiple operators, and passing 
constraint information through views (realizing that part of the constraint may not 
be expressible at the other tier, and few users can understand complex formulas). 

Two broad challenges are critical to the success of this approach. First, vendors 
need to implement and perfect the framework specifications and services. Second, 
because multi-tier systems often span organizations, we need to borrow and use 
well-known ontologies for metadata and operation, both from consortia (World 
Wide Web consortium. Metadata Consortium) and from disciplinary bodies (e.g., 
Dublin Core, or geospatial metadata standards). 
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Abstract 

Active database rules have been identified as a useful technology for integrity 
constraint maintenance in centralized database systems. Maintaining con- 
straints in a distributed environment such as that of a multidatabase system 
provides an even more challenging task for active rule technology. This pa- 
per presents the notion of distributed active rules for constraint processing in 
a distributed environment, together with an architecture for the use of such 
rules. The specification of distributed active rules is based on the statement of 
constraints that exist between heterogeneous database sources. The condition 
of a distributed rule must provide for an efficient means to check both local 
and remote constraint conditions. We present the structure of distributed ac- 
tive rules and provide an execution semantics for such rules. We cJso describe 
an architecture for communication between local and global rule processors. 
Finally, we discuss future research issues associated with the analysis of dis- 
tributed constraints and the generation of distributed active rules. 

Keywords 

Integrity Maintenance, Multidatabases, Distributed Active Rules 



1 INTRODUCTION 

An important challenge in modern information systems is the integration of 
heterogeneous and autonomous data systems. The integration of such compo- 
nents is commonly known as a multidatabase system (Elmargamid et al. 1990). 
As part of that challenge, a significant problem to be addressed is consistency 
maintenance among distributed database components. In centralized and dis- 
tributed systems, integrity constraints are often implemented directly in the 
application code. As a result, errors and omissions in the checking and main- 
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tenance of constraints can easily be introduced. Furthermore, if guaranteeing 
database consistency in a single database is a difficult problem, data commu- 
nication issues, heterogeneity and autonomy make constraint maintenance in 
a multidatabase environment an even more complex task. 

Decoupling applications from consistency maintenance is referred to as 
knowledge independence in (Baralis et al. 1994). To support knowledge in- 
dependence, active database technology (Widom et al. 1996) can be used 
together with a constraint specification language to provide automated re- 
sponses to constraint violating operations. Active database technology has 
primarily been investigated in the context of centralized systems, making use 
of Event- Condition- Action rules to provide reactive behavior. For example, 
when an event occurs that may potentially violate a constraint, a condition 
is evaluated to test for constraint satisfaction. If the constraint is violated, an 
action is triggered to repair the constraint violation. 

This paper presents the concept of distributed active rules for constraint 
maintenance in a multidatabase environment, together with an architecture 
for the execution of such rules. The specification of distributed active rules 
is based on the statement of constraints that exist between heterogeneous 
database sources, where constraints can either be private global constraints 
or public global constraints. In the caise of private global constraints, a lo- 
cal database expresses a constraint that must be maintained locally based 
on conditions that exist in remote databases. Public global constraints are 
constraints that must be maintained globally through cooperation among all 
databases involved in the constraint. Constraints in this environment are ex- 
pressed using the Multidatabase Constraint Specification Language (MCSL), 
(Gomez et al. 1997), which is based on the ODMG 2.0 standard (Cattel 1994) 
for the expression of constraint query conditions. 

Distributed active rules are developed based on the statement of MCSL 
constraints. The specification of distributed active rules must be concerned 
with checking both local and remote conditions for the purpose of detecting 
constraint violations. Furthermore, distributed rules stored at a local site can 
be triggered by events that occur at remote sites. In this paper, we describe 
the structure of distributed rules and the manner in which rule conditions are 
organized into local and remote conditions. We also describe the architecture 
of the environment, including rule processing components that must exist at 
each local database and the global rule processing components that must be 
constructed to support distributed execution of remote rule conditions. 

The contribution of the work presented in this paper lies in the extensions 
that we have defined for transforming event-condition-action rules into rules 
that function over distributed data. The use of distributed active rules sup- 
ports the definition of non-trivial constraints between heterogeneous database 
sources and provides a viable mechanism for communication between databases 
in the checking and maintenance of such constraints. Through the use of such 
rules, active database technology can therefore be extended into distributed 
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domains, where the databases involved in the multidatabase environment are 
autonomous and otherwise passive database systems. 

The remainder of this paper is organized as follows. Related work on con- 
strciint specification languages and constraint maintenance in distributed en- 
vironments is first presented in Section 2. In Section 3, we describe the mul- 
tidatabase environment and the type of constraints supported. Section 4 de- 
scribes the distributed active rule lamguage. Rule execution semantics and the 
architecture of the distributed active multidatabase system cire presented in 
Section 5. The paper concludes in Section 6 with future research directions. 



2 RELATED WORK 

Research on the use of active databases for constraint maintenance has re- 
ceived substantial attention in recent years (Widom et al 1996). Most of this 
work has primarily been investigated in the context of centralized systems 
and does not consider rules and constraints in distributed or heterogeneous 
database systems. In this section, we address research related to integrity 
constraints and distributed active databases. 

Constraint management in distributed environments initially focused only 
on tightly coupled systems (Simon et al. 1986). Later, new approaches were 
proposed for loosely coupled environments in which distributed transactions 
are not available. One approach to constraint maintenance in a multidatabase 
is based on the concept of data dependencies. A Data Dependency Descriptor 
model which includes consistency predicates and restoration procedures is 
proposed in (Rusinkiewicz et al. 1991). Another approach described in (Ceri 
et al. 1993) uses active rules and persistent queues to maintain the consistency 
of existence and value dependencies between relational databases. 

Protocols are used in (Grefen 1994) for integrity constraint checking in fed- 
erated databases. The basic protocol detects an update, raises an alarm when 
a violation is detected, and notifies the Constraint Manager. Protocols vary in 
terms of requirements of the underlying systems, level of cisynchronous com- 
munication, flexibility and execution cost. Not all the protocols proposed are 
accurate, meaning that they can produce ‘false alarms’ (notification when a 
violation did not occur). Repairing actions in this approach are not addressed. 

(Chawathe et al. 1996) suggest a formal approach for constraint manage- 
ment in loosely coupled distributed databases where locking and transaction 
primitives may not be available. Weaker notions of constraint maintenance 
are formalized and an event-based formal framework is introduced. 

The constraint approach in (Grufman et al. 1997) describes the integration 
of a functional database and an active object system to enforce integrity across 
a multidatabase. The systems are integrated using a tightly- coupled approach 
where the global schema is maintained by the functional database. Constraints 
are limited to universally quantified variables over a simple conjunction of 
predicates. 
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Optimization techniques for distributed constraint checking to avoid remote 
database access have also been investigated. In (Barbara et al. 1992), the 
Demarcation protocol is used to maintain simple arithmetic constraints. The 
work in (Gupta 1993) suggests a method to generate local tests to test global 
integrity constraints. 

Issues related to the use of active rules in a distributed environment have 
only recently been investigated. Rule decomposition, rule distribution, and 
correct evaluation of distributed rules in a distributed active databcise are 
analyzed in (Hsu et al. 1992). This research is performed in the context of 
relational databases where relations are partitioned horizontally and/or verti- 
cally and segments are distributed among sites. Rule queries are decomposed 
using algebraic manipulations based on principles of query optimization. Con- 
dition evaluation can be done in a distributed fashion, but rule processing in 
general is still centralized. 

In (Ceri et al. 1992), a locking scheme and a rule-task executor that allow 
rules to reference data at multiple sites is described. To coordinate sites and 
support rule priorities, additional locking and communication protocols are 
proposed. One limitation of the approach is that tables cannot be replicated 
or fragmented across sites. 

In (Pissinou et al. 1996), a reactive multidatabase architecture that permits 
the explicit specification, recognition and resolution of temporal changes to 
support interoperability of objects over time is proposed. Our architecture 
differs from (Pissinou et al. 1996) in that rule events, conditions and actions 
are decoupled and can be executed at different sites. 

The difference between our work and the research presented above is that 
we address remote rule condition testing and remote action execution within 
an object-oriented approach to the expression of constraints and rules. Active 
Rules are structured to use optimization techniques that minimize commu- 
nication with remote sites. Furthermore, we focus on non-trivial inter-object 
constraints that involve components stored in different databases. 



3 THE MULTIDATABASE ENVIRONMENT 

This section presents the general framework of the multidatabase environment 
that we are assuming for this research. Section 3.1 describes the architecture 
of the environment together with an Airline application that will be used as a 
running example in the rest of the paper. Section 3.2 then describes the types 
of constraints that are maintained through the use of distributed active rules. 



3.1 The Multidatabase environment 

We are cissuming a loosely-coupled, federated database system in which there 
is no global schema. Each database in the federation may have a different 
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data model and different databcise capabilities. It is the responsibility of the 
multidatabase administrator to resolve any differences between database com- 
ponents. Although each database component is autonomous, some of the 
databases may require access to data from remote components of the fed- 
eration to enforce application rules. 




Figure 1 Multidatabase Architecture 

To integrate information from different databases, each database provides 
an object-oriented sub-schema based on the ODMG model (Cattel 1994). Each 
database in the multidatabase environment can access information from the 
sub-schemas provided by other databases in the federation. To access data 
in a remote database, a local database will import an ODMG definition of 
the sdiema provided by the remote database. The imported schema, on the 
other hand is viewed as the export schema of the remote database. Also, a 
single database can export different sub-schemas to different databases. For 
example. Figure 1 illustrates a multidatabcise composed of four databases. 
Database A exports sub-schema A1 to remote databcuse B and sub-schema A2 
to remote databases C and D. Database A also imports sub-schemcis Cl from 
database C and D1 from database D. Not all databases need to export data. 
Sometimes a database only needs access to remote data such as in the case of 
database B. 

To support our discussion of distributed constraints and rules, we introduce 
a small airline example. Assume that AirFun, Little Air, and Control Air are 
three independent enterprises that maintain their own database applications. 
These three companies need to coordinate and share information. Control Air 
is a regulation agency that maintains global information about crew members, 
cities and airlines. Control Air also maintains statistics about accidents, the 
flight history of pilots and other information that may be used for any air- 
line. AirFun is an airline that provides passenger transportation and cargo 
services. This cdrline keeps information about flights, planes, and packages 
shipped. LittleAir is a small airline that only offers passenger transportation. 
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LittleAir does not have crew staff and has to contract the services of Air- 
Pun airline. The database of LittleAir only keeps information about flights. 
Flights from AirPun or LittleAir can be assigned to any crew member who is 
registered by the regulation agency. 




Figure 2 Exported schemas for Airline Example 

The export schemas of AirPun, LittleAir and Control Air are shown in Fig- 
ure 2. In the graphical notation, each abstract object is represented inside a 
box. The upper part contains the name of the object, the middle part con- 
tains all simple properties^ and the lower part contains all derived attributes 
and methods. Relationships are represented by labeled arrows between the 
abstract objects. Single- arrows and double-arrows represent single- valued and 
multi-valued properties, respectively. Bold unlabeled arrows represent ISA re- 
lationships. 



3.2 Multidatabase Constraints 

In the multidatabase environment introduced in the previous section, indi- 
vidual databases collaborate in the exchange of information. External control 
from other databases, however, is limited. In this environment, we identify 
two different forms of distributed constraints: Private Global constraints and 
Public Global constraints. 
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Private Global Constraints (PvGC’s) define dependencies between data 
that is stored in more than one database. PvGC’s Eire considered private be- 
cause the remote database is not awEure of the constrmnt. The remote database 
allows limited access to some of its information through an export schema. To 
define a PvGC’s, it is possible to use any component of the interned database 
schema of the local database as well as any component of the remote schemas 
that are being imported to the local database. Since remote databeises are 
not aware of the constraint, the loced database cannot generate actions that 
may alter the remote databaises when a PvGC violation is detected. However, 
the local database can be informed when a change in the remote database 
has occurred. The local database can then take some loccd action to fix the 
violation. Consider the following example of a PvGC: 

Pilots in Little Air can only fly on the planes for which they successfully 
completed the required training by ControlAir. 

If a license to fly plane ‘B777’ is revoked for a pilot in the ControlAir databcise, 
a notification is generated to remote database Little Air. If the constraint is 
violated. Little Air does not have the authorization to abort the transaction 
in ControlAir. It can, however, generate a loccd action to eliminate the pilot 
from all the flights that have plane ‘B777’ cissigned. 

Public Global Constraints (PbGC’s) Eire constrEunts that are associated 
with entities in more than one databEise. All participEuit databases in the 
federation agree with the definition of this kind of constraint. The specifica- 
tion of PbGC’s is done in the context of the exported schemas. The internal 
schemas of each database are not available for the federation. Each of the 
individuEd databases involved in the specific constrEiint collaborates during 
the constraint maintenance process. In the local databEise in which the event 
is triggered, if the global constrEiint is violated a local action is executed to 
restore the violation. An example of a public globEil constrEiint is: 

For safety reasons imposed by the regulation agency^ pilots and flight at- 
tendants cannot fly more than 8 hours in the same day. 

Private Eind Public global constraints are expressed in a high level declEu:- 
ative lEuiguage CEdled the Multidatabsuse ConstrEunt Specification Language 
(MCSL). MCSL provides a syntax bEised on ODMG OQL that Eillows easier 
expression of complex constrEiints between databsises. Consider the constrEunt 
example illustrated in Figure 3. This safety constrEunt imposed by the regu- 
lation agency indicates that crew members cEuinot fly more thEin 8 hours in 
the same day . 

The For All section supports the declEiration of object VEiriables used in the 
expression of the constrEunt, indicating that the condition to be expressed as 
pEirt of the constrEiint must be true for Eill Crew members c» Euid Eill flights 
fi assigned to the crew. The scope of variables in the forall section CEin be 
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PbG CONSTRAINT AirFun_valid.flightJiours 
ForAll: 

c in AirPun::Crew, f in c.flights-assigned 

Define: 

AirFun_hours := c. flight -hours (f. date), 

Little Air«hours := element SELECT c2. flight-hours (f. date) 
FROM Crew c2 
WHERE c.ssn= c2.ssn 
USING LittleAir 

Condition: 

sum(AirFun-hours, Little Air-hours) <= 8 

end 

Figure 3 Safety constraint example 



any object defined in the import or local schema. The notation Airfun::Crew 
denotes that Crew is a class in the Air Fun databsise. 

The Define section allows the declaration of variables that represent values 
that result from the evaluation of local or remote queries. Such variables can 
be used in the specification of the constraint Condition, In the above example, 
Air Fun-hours is the number of hours that c» has in the local database on the 
date of flight . Similarly, LittleAir-hours is the number of hours that C{ has 
in the remote database on the same date. The condition expresses that the 
sum of the hours should not exceed the value of 8. 

To enforce MCSL constraints, we use distributed active rules. The follow- 
ing sections describe how distributed rules are used to check and maintain 
constraints. 



4 DISTRIBUTED ACTIVE RULE DEFINITION 

Given the basic assumptions about the multidatabase system and the con- 
straint language, this section presents the details of the distributed active 
rule language. Section 4.1 describes the basic structure and gener8d seman- 
tics of distributed rules. The use of distributed active rules is illustrated with 
examples in Section 4.2. 



4.1 Rule structure 

Existing rule languages are defined in the context of centralized database 
systems. An important aspect to consider in a distributed environment is the 
identification of the local and the remote components needed to evaluate the 
rule in an efficient manner. For example, in some cases it may be possible 
to validate globed constrednts by checking local data only (Gupta 1993). In 
other cases, it may be required to examine data at remote sites. Under some 
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circumstances, both local and remote conditions must be examined. Given 
these considerations, the basic structure of a distributed active rule is shown 
in Figure 4. 

Rule <rulename> 

Event: E\ or E 2 or ... or En 

[Condition: 

[exists {Oi in < class jvariahle> | < class jv ariahle> .<attrihute> } where ] 
<locaLconditionjname>{pi, ...,pn)=true OR 
( < locaLconditionjname> (pi , . . . , pn ) =unknown AND 
< remote^conditionjname> (pi , . . . , ) =true) 

] , 

Action: < action4ist> 

Priority: 

[before <rule-list>] 

[after <rule4ist>] 

End-Rule 

Figure 4 Structure of the distributed active rule 

In the event specification, each E{ has the form before | after <event- 
name>. The before and after options are used to indicate when the condition 
and action of the rule are executed with respect to the event. Specificcdly, 
the before directive indicates that the rule condition and action are evaluated 
before the execution of the event. Similarly, the after directive indicates that 
the condition and action are evaluated after the execution of the event. A 
rule can be triggered by events that occur at local or remote sites. Events at 
remote sites, however, can only be used together with the after directive. 

In general, any method can be used in the event specification. However, 
the events of interest for constrmnt maintenance are only the low-level op- 
erations that change the state of the database. In the object-oriented model 
used cis a framework, the low-level operations that can alter the state of the 
database are: 1) New^<object^name> and Delete-<o6jecLname>, used to cre- 
ate and delete instances, 2) Modify_<a^^r_name> and Modify_< 5V-re/_name>, 
used to chcinge V£jues of attributes or single- value relationships, and 3) In- 
sevt^<,mvjreLname> and Delete_< mi;_re/_name>, used to create or delete 
multi-valued relationships. 

As in traditioncd centralized environments, the condition is a query that 
determines if the constrmnt is violated. A condition evaluation that returns a 
value of true^ indicates that there is a violation of the constraint. An action 
can then be executed to restore the consistency of the data. Since it may be 
necessary to evaluate the rule condition in the local database 2 uid in remote 
locations, the condition is composed of a local part and a remote part. The 
local condition is evaluated first and if it returns a value of true, there is no 
need to check the remote component. In this case, the action can be triggered. 




28 



If the local condition returns a value of /a/se, the condition is satisfied cind 
the action is not triggered. For distributed constraints, however, there may 
not be enough information in the local database to validate a constraint. If 
the local condition evaluation returns a value of unknown^ then the remote 
condition must be tested. Testing the condition of a distributed active rule is 
complicated by the fact that the condition must be evaluated for all objects 
aflPected by the event. As a result, for one instance of a triggered rule, some 
objects affected by the constraint may require the checking of local conditions 
only while others may require the checking of local and remote conditions. A 
more detailed description of the rule execution semantics occurs in Section 5. 

Another option in the rule specification is the omission of the condition 
clause. These types of rules are known as event-action rules, in which the 
condition is cissumed to be true and the action is always executed when the 
event is triggered. 

For constraint maintencince, there are two types of actions that can be 
executed: abortive and corrective actions. An abortive action can be executed 
in the local database when a local event introduces a violation of a public 
global constraint. Corrective actions are executed when a remote event causes 
a violation of a private global constraint in the local datab 2 ise that owns 
the private constraint. In this Ccise, the local database does not have the 
authority to alter the state of the remote transaction that generated the event. 
Corrective actions can be executed, however, in the local database to restore 
the consistency of the data. 

Finally, a rule priority Ccin be defined using the before and after clauses 
in the rule definition. The rules specified in the before clause of rule R{ axe 
executed before R{ and the rules specified in the 2 dter clause are executed 
after Note that we also assume immediate coupling modes between the 
event and the condition and between the condition and the action. We have 
not yet addressed the issues of deferred coupling modes for distributed rules. 



4.2 Rule Examples 

To illustrate the distributed active rule language introduced in the previous 
section, this section presents several rule examples that refer to the airline 
application of Section 3.1. Consider again the private globed constraint from 
Section 3 that restricts the planes that can be gussigned to a pilot. The speci- 
fication of this constraint is shown in Figure 5. 

During the rule specification process we need to identify the operations that 
can affect the constraint and the entities involved. For example, this constraint 
involves the entities LittleAir::Crew, Lit tie Air:: Flight, Little Air: :Plane, Con- 
trolAir::Crew and Control Air ::Plane. Identifying the possible updates that can 
affect this constraint is not trivial because the information about the type of 
plane a pilot can fly is stored in the datab 2 ise ControlAir and the assignment 
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PvG CONSTRAINT LittleAir-Crew.Can_Fly 

Description: Pilots can fly only in pl2uies for which they completed re- 
quired training 

ForAU : 

c in LittleAir::Crew, f in c. flight s.assigned 

Define: 

plane.type«2issigned := f.plane.assigned.type 
Condition: 

IF c.title =TILOT’ and plane_type_assigned<> NULL THEN 
Exists SELECT p2.type 

FROM Crew c2, c2.c2tnfly p2 

WHERE c.ssn = c2.ssn and p2.type= plane_type_assigned 
USING ControlAir 

end 

Figure 5 Can-fly constr2iint example 



of flights is stored in Lit tie Air database. Furthermore, there is no direct re- 
lationship between a crew member and the type of plcine he/she is cissigned 
to fly in any of the airline databases. The plane type must be examined by 
traversing through the flights assigned to each crew member and the plsme 
assigned to each flight. 

Assigning a flight to a crew member, changing the plane assigned to a flight, 
modifying the type of a plane or removing a canfly relationship in ControlAir 
are a few examples of operations that can affect this constrciint. For each 
operation, an active rule can be deflned. The following examples illustrate 
two of those rules. 

Example 1 The active rule in this example is used to verify the consistency 
of the database when a plgme is assigned to a flight: 



Rule Crew_CanJFly 

Event: before LittleAir: :Flight. modify _plane_assigned(Flight, Plane) | 
before LittleAir: :Pl2ine.insert-flights_assigned(Pl2me ,Flight) 
[Condition: 

exists C in Flight. crew.assigned where 

Local-Crew«can-FlyJs_invalid(C, Flight)=true OR 
(LocaLCrew_can.FlyJs_invalid(C, Flight) =unknown AND 
Remote_Crew_can_FlyJsJnvalid(C, Flight)=true) 

Action: Abort 
End-Rule 

The loc2d and remote condition are evaluated for all crew members Ci G 
Flight. crew-assigned. If the condition is true for any c», the transaction is 
aborted. The details of the local and remote conditions are: 
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Rule_condition LocaLCrew_canFlyJs_invalid (c,f) 

Subcond 1: 

if c.title= ’PILOT’ and f.plane.assigned <> NULL then 
Execute Next condition (Subcond 2) 
else return (false) 

end 

Subcond 2: 

plane_type_assigned := f.plane_assigned.type 
O Q _Crew_canFly (c , plane.type^assigned , found) 
if found =true then return (false) 
else return (unknown) 

end 

EndRule 



Rule -Condition Remote_Crew_canFly_is_invalid (c,f) 

Subcond 1: 

RQ _Cre w_canFly ( c . ssn ,plane_typ e_cissigned ,f ound) 
if found =true then return (false) 
else return(true) 

end 

EndRule 

Using the constraint specification, local and remote tests are developed. 
The local condition is expressed in terms of loc 2 d objects only. The remote 
condition is defined in terms of the results obtained in the local condition 
combined with calls to remote databases. For each combination {c{ ^Flight) ^ 
the local condition is evaluated first. In the loc 2 d rule condition, notice that 
there axe two sub conditions. The first subcondition is evaluating the local 
predicates identified in the constraint specification. The second subcondition 
contains OQ-calls that optimize the remote condition checking process with 
the introduction of additional tests that can avoid the need to evaluate the 
remote condition. The ‘OQ’ or Optimizer Query identifies the method call 
as a locally optimized test. In the example above, the OQ_Crew_canFly is 
checking if C{ has another flight in the local database with the same airplane 
type. If another flight is found, we can conclude that the plane type is valid and 
there is no need to test the remote condition. However, if no flight is found, the 
status of constraint satisfiability is unknown and we proceed to test the remote 
condition. The remote condition calls the remote query RQ_Crew_canFly to 
find in ControlAir if the combination Ci and planedype-assigned is valid. 

Example 2 In this example, we illustrate that two different options for event- 
action rules can be used to restore the consistency of the data when a license 
to fly a plane is revoked for a pilot in ControlAir database. Notice that in 
this example. Little Air receives a remote event originated in ControlAir. Let 
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(‘JohnVB777’) be the link removed in ControlAir. To eliminate the violation 
there are two options: 



1. To remove all the flights with plane ‘B777’ that ‘John’ has in LittleAir. 

Rule Crew_Cannot_Fly_l 

Event: after ControlAir:: Crew. delete_can-fly(C, Plane- type) 
Action: LittleAir::Crew.Delete_flights_assigned(C.ssn, Plane-type) 

EndURule 

The action Delete Jlights.a-ssigned is implemented as the following update: 

Delete relationship flights .assigned (crew_assigned) 
from crew c2, c2.flights_assigned f 

where c2.ssn=‘Johnjssn’ and f.plane_assigned.type=‘B777’ 
using LittleAir 

2. To remove the plane ‘B777’ from all the flights that ‘John’ has in LittleAir. 

Rule Crew_Cannot«Fly.2 

Event: after ControlAir:Crew.delete_can-fly(C, Plane-type) 

Action: LittleAir::Flight.Delete.plane_cissigned(C.ssn, Plane-type) 

End_Rule 

The action Delete.plane.cissigned corresponds to the following update: 
Delete relationship plane.cissigned (flights .assigned) 
from crew c2, c2.flights.assigned f 

where c2.ssn=‘Johnjssn’ and f.plane.assigned.type=‘B777’ 
using LittleAir 



5 RULE EXECUTION SEMANTICS 

Given the presentation of distributed active rules in the previous section, 
this section describes the execution semcintics for such rules. The distributed 
active database architecture that supports the execution of the distributed 
active rules is presented in Section 5.1. Section 5.2 describes the algorithm for 
processing distributed active rules. 



5.1 Distributed Active Database Architecture 

The architecture of the active component of the multidatabase environment 
is illustrated in Figure 6. We assume that each component can respond to 
read-only requests from remote components. In addition, local components 
can receive an event notiflcation from a remote component and start the 
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execution of local trcinsactions. Since the system does not support distributed 
trcinsactions, local components are not allowed to send update requests to 
remote components. The database components can be paissive databases. A 
layered approach is used to support the active functionality required for the 
detection of the events and the processing of the rules (Widom et al. 1996). 




Figure 6 Architecture of the active multidatabase system 

The rules used in the distributed active rule system are stored in the data 
repositories. The Constraint Catalog contains the specification of the con- 
straint in MCSL. This specification is used to report violations to the user. 
The Local Rule Repository contains global and local rules used to maintain 
consistency of the local database. The Global Rule Repository contains infor- 
mation about the remote queries and the corresponding global rules. 

As shown in Figure 6, every component database is assumed to have an 
Update Processor that executes update requests in the local database. If an 
update is specified as an event of an active rule, the rule is executed. If the local 
database is a traditional psissive database, then the update processor must be 
extended to signal the local event processor when a change has occurred in 
the database. The Query Processor executes read-only requests in the loc 2 d 
database. Since these requests do not change the state of the database, the 
query processor does not trigger any integrity constraint rule. However, the 
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query manager participates in the rule execution process by evaluating local 
conditions. 

The Local Rule Manager controls the execution of local and distributed 
active rules. When an event is detected, the rule manager triggers the rules 
cissociated with the event and calls the Local Condition Evaluator to test 
conditions in the local database. If there is not enough information to verify 
the constraint locally, the rule manager invokes the remote rule processing 
mechanism. 

The remote rule processing mechanism provides the communication inter- 
face between databases and controls the execution of global rules. The Global 
Rule Processor executes the remote calls needed to check constraint conditions 
from a remote databcise. The Remote Condition Evaluator is invoked by the 
global rule processor to evaluate read-only queries in a remote database. In a 
distributed environment, the same rule may be triggered by events in differ- 
ent databases. Therefore, it is necessary to provide an execution model that 
supports both concurrent and sequential rule execution. The Global rule pro- 
cessor allows concurrent execution of rules but also serializes rule execution 
when conflicts in the concurrent access of data are detected. Concurrency con- 
trol and recovery for transactions operating in the local database is provided 
by the Local Transaction Manager. 

The Remote Event Processor is invoked when an event is executed at a 
remote databcise. When the remote event occurs, the databcise in which it 
occurs must notify the Remote Event Detector of the remote rule processing 
mechcinism. The remote event detector then signals the remote event processor 
at the local databases that are interested in the occurrence of the event. 
When the signal is received, an active rule is triggered. Typically, the actions 
executed within the rule are local corrective operations to satisfy a constraint 
that was violated by the remote database. 

Interoperability of all database components in the environment is achieved 
through the use of CORE A technology (COREA 1993). In particular, we have 
used OREeline (OREeline 1994) to develop the prototype for this research 
(HecJy 1997). The use of the COREA distributed object framework provides 
several advantages. First, COREA provides the mechsmisms by which objects 
transparently make requests and receive responses by simply invoking meth- 
ods calls. Second, the low-level communication is hidden by COREA and the 
modules are written independently of the communication. Finally, COREA 
objects are not attached to a specific location. Therefore, it is easy to redis- 
tribute modules as the system matures. 



5.2 Rule Processing 

In this section we present the 2 dgorithm for processing of distributed active 
rules. To formcilly describe the rule processing algorithm, we first introduce 
several concepts and definitions. 
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Definition 1 Let Ki be a constraint. The Constraint Class Set of K{ denoted 
as CCS(Ki) = {( 7 i, C2, Cn} is the set of all class names on which the 
constraint Ki has an effect. 

Definition 2 Let Ej be an event in the database. The Potentially Violated 
Set of constraint Ki by the event Ej^ denoted PVS(Ki^Ej)^ is the set of 
object instances {Pi, P2, Pn} that can violate the constraint Ki when Ej 
is executed. Each Pi is a tuple of the form [Oi, O2, On], where each Oi is a 
member of a class in CCS(Ki). 

Definition 3 The before- event rule set BR{Ej) = {6ri,6r*2, ...,6rn} is the set 
of rules with the before directive in the specification of event Ej. Likewise, 
the after-event rule set AR{Ej) = {ari ,ar2, ...,arn} is the set of rules with 
the after directive in the specification of event Ej. 




Figure 7 Trcinsaction processing 



The control flow for transaction processing at each local site is shown in 
Figure 7 . When the event Ei is detected, the rule process starts for the 
before-event rules {6ri ,6 t 2, ...,6rn}. If an abort directive occurs with such a 
rule, the transaction is aborted immediately. If a rule does not invoke an abort, 
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the event is executed. Finally, the set of after-event rules {ari,ar 2 , ...,arn} is 
processed in the same manner. 

Each rule set is processed sequentially before and after the event using al- 
gorithm 1. In each rule set, rules are ordered based on priority. During the 
rule evaluation process, there are two queues that contain the object instances 
that are being tested for constraint violation. The local condition queue con- 
tains the object instances that need to be evaluated in the local database. 
Similarly, the remote condition queue contains the object instances P{ that 
need to be evaluated in the remote database. 

Algorithm 1. Let Ej be the update operation that triggered the rule, the 
processing of the before-event and after-event rule set is as follows: 



While rule set not empty { 

1. Select a Rule R{ from rule set 

2. Identify potentially violated set PVS{Ki^Ej) and append each 

Pi £ PVS{Ki^Uj) to the local condition queue. 

3. While local condition queue not empty 

3.1 Get tuple [Oi, O 2 , On] from local-condition-queue 

3.2 result= evaluate-local-condition(condih'on-name, Oi, O 2 , On) 

3.3 if result ^unknown then 

enqueue (remote-condition-queue, [0 \ , O 2 , ..., On]) 

3.4 if result = true then 
execute action 

4. While remote condition queue not empty 

4.1 Get tuple [Oi , O 2 , On] from remote-condition-queue 

4.2 result = evaluate-remote-condition(condih'on-name, Oi, O 2 , On) 

4.3 if result = true then 
execute action 

} 



The condition evaluation of each P{ involves local and remote condition 
checking. A result of unknown in the evaluation of the local condition for tuple 
Pi indicates that there is not enough information in the local databcise to test 
the constraint. Therefore, the condition checking is not complete and the Pi 
is appended to the remote condition queue for further processing. If the result 
of the condition is false ^ there is no constraint violation and there is no need 
to check the condition for Pi in the remote database. However, if the result of 
the condition is true^ the constraint is violated and the rule action is executed 
immediately. If the rule action is an abort, the rule processing algorithm 
terminates. If the rule action is a corrective action then the corrective action 
is invoked as a subtransaction to the triggering event for each object that 
violates the constraint. After the local condition is tested for all P^’s in the 
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potentially violated set, the remote condition is evaluated for the P»’s in the 
remote condition queue. 

Example 3 Assume that John, Mike and Peter are pilots. Eagle and Early- 
Bird cire plcines of type ‘A320’. Sky cind Sunrise are planes of type ‘B777\ The 
information of the type of planes that the pilots can fly as well as the flight 
assignments is shown in tables 1 and 2 : 



Crew 


canfly 


John 


‘B777’, ‘A320’, ‘A340’ 


Mike 


‘B777’, ‘A320’ 


Peter 


‘A320’, ‘A340’ 



Table 1 Crew Information 



Flight 


crew-assigned 


plane-assigned 


flOO 


John, Mike, Ann 




flOl 


John, Mike 


Eagle 


fl02 


Peter, Ann 


Eagle 


fl03 


John, Ann 


Sky 



Table 2 Flight Information 

For simplicity we use the crew name instead of ssn to identify the crew mem- 
bers. Also we show directly the values of the attributes instead of object 
identiflers. Notice how the algorithm is executed when the following events 
trigger the distributed active rules: 

1. Assign plane EarlyBird to flight flOO. 

The rule crew_can Jly is triggered and the potentially violated set is iden- 
tifled. When the loccd condition is checked, all elements satisfled the local 
condition. John and Mike have another flight with type ‘A320’. Since Early- 
bird is also of type ‘A320^ it is cissumed that the constraint is not violated. 
Ann is not a pilot and therefore satisfles the constraint. As a result, there 
is no need to check the remote constraint. 

PVS= { [John, flOO], [Mike, flOO], [Ann, flOO] } 

Local Condition Queue= { [John, flOO], [Mike, flOO], [Ann, flOO] } 
Remote Condition Queue{ null } 
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2. Assign plane Sunrise to flight flOO. 

The rule crew.can Jly is triggered and the potentially violated set is identi- 
fied. When the local condition is checked, John cind Ann satisfied the local 
condition because John has cinother flight with type ‘B777’ and Ann is not 
a pilot. The remote constraint is only verified for Mike and no violation of 
the constraint occurs. 

PVS= { [John, flOO], [Mike, flOO], [Ann, flOO] }. 

Local Condition Queue= { [John, flOO], [Mike, flOO], [Ann, flOO] } 

Remote Condition Queue{ [Mike, flOO] } 

3. Revoke license for plane type ‘B777’ to John. 

Assume that we are using the Crew_Ccinnot_Fly_l rule presented in Ex- 
ample 4.2 The action will eliminate relationship crew_assigned {John} for 
flights {fl00,fl03}. 

6 SUMMARY AND FUTURE WORK 

This paper has presented the concept of distributed active rules for the check- 
ing and maintenance of constraints in a multidatabcise environment. Dis- 
tributed rules extend the traditional notion of event-condition-action rules 
with the specification of rule conditions that involve local and remote com- 
ponents. We presented an execution model for distributed active rules as well 
as a distributed rule processing architecture for the execution of such rules. 
Distributed active rules support the use of complex constraints between het- 
erogeneous database systems by decoupling multidatabase constraint enforce- 
ment from application code and providing a more general mechanism for the 
enforcement of complex constraints between distributed data sources. 

There are several directions for future research that we are currently investi- 
gating. As illustrated in this paper, the rule definition process is not a trivial 
task. We are investigating techniques for the automated analysis of MCSL 
constraints to assist in the generation of the distributed rules. An important 
aspect of this work is to develop techniques for optimizing rule conditions 
so that the need for remote condition testing is minimized by making use of 
loccJ data whenever possible. Another important aspect of this work involves 
a thorough analysis of distributed constraints to identify the operations that 
can violate a constraint and the databases in which those operations can oc- 
cur. Finally, although we have experimented with the implementation of the 
concepts presented in this paper, a full implementation of a distributed active 
rule processing environment is still under development so that we can better 
analyze architectural and execution issues of distributed active rules. 
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Abstract 

A declarative mediator language, based upon operations among logic theo- 
ries is introduced. In particular we concentrate on the constraint operator. 
The denotational semantics of the language is introduced together with the 
definition of a bottom-up efficient implementation. The use of the constraint 
operator for security within a mediator architecture for database integration 
is suggested and presented by means of a simple example. 
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1 INTRODUCTION 

The Internet and the World Wide Web capability are showing the need for 
organizations to access and integrate different sources of information. Future 
applications are likely to be built by putting together systems developed and 
managed at different sites. Integration, federation, cooperation etc of infor- 
mation sources or software systems, in general, seem to be a must. Security 
and privacy are thus becoming more crucial (Jajodia 1996a), (Jajodia 1996b), 
both for relational and advanced database systems. 

*work partially funded by Progetti Coordinati CNR-Comitato 12 : ”KINE-Knowledge In- 
tegration Environment” and ’’Programmazione Logica (Logic Programming)” and EADTN 
Project: ERCIM Advanced Database Technology Network, EEC contract n. CHRX-CT94- 
0531. 





42 



Given the state of the art, providing for semantic heterogeneity and for 
’’secure” integration of information sources/databases is crucial for the devel- 
opment of future distributed/integrated applications. 

Semantic heterogeneity and integration of databases is a hot area of re- 
search. Various architectures can be found in the literature, as it is in (Hull 
1997), where a mediator (notion due to Wiederhold (Wiederhold 1992)) turns 
out to be a very interesting and promising one. 

Mediation supports semantic integration of databases providing for read- 
only view of information sources that reside on different sites and, in some 
cases, update capabilities. 

Interesting proposals for mediating database systems can be found in the lit- 
erature, both for relational and logic/deductive databases (Papakostantinou, 
Garcia-Molina and Ullman 1996, Subrahmanian 1994, Lu et al 1995, Aquilino 
et al. 1995, Asirelli, Renso and Turin! 1996). 

With respect to security of systems, many models, policies and enforcing 
mechanisms can be found in the literature, as it is summarized in (Bertino, 
Samarati and Jajodia 1997) and (Jajodia et al. 1997). Security, within the 
framework of deductive databases and their integration has also received great 
attention (Bertino, Jajodia and Samarati 1995, Bonatti, Kraus and Subrah- 
manian 1995, Candan, Jajodia and Subrahmanian 1996). In relation to a 
mediator approach to security we also mention the TIHI project (Wiederhold 
et al 1996a, Wiederhold et al 1996b). 

In this paper we consider an approach to build a federation of information 
sources via mediators. The language we refer to is MedLan (Aquilino et al. 
1995, Asirelli, Renso and Turin! 1996, Aquilino et al. 1997) that is an extended 
logic language for deductive databases where the basic extensions are the 
partitioning of the deductive database into a collection of theories, operators 
to combine them, and the “in” feature, that is a sort of “message passing” 
feature. MedLAn has been given an operational and a denotational semantics. 

In this paper we particularly concentrate on the definition of a kind of 
seminaive implementation of the MedLan Language, where the most inter- 
esting aspect is the constraint (/) operator. This operator allows for different 
applications (Renso 1998, Aquilino et al. 1997, Asirelliet al. 1998) that we 
have studied and developed. Here we present, as an example of the use of the 
constraint operator of MedLan, an approach to database security within a me- 
diator architecture for database integration. For the moment we have taken 
into consideration a multilevel security model, as the Bell-La Padula model, 
where the data and the users are classified into various classes (or levels) and 
then the appropriate security policy of the organization is implemented. 

The general idea is to use the language MedLan to build a set of logical 
theories that constitute a middle layer of an architecture to build an appli- 
cation which uses a federation of databases 2 ts the source of information. In 
other words, MedLan can be used to implement the layer of an application, 
in an integrated environment, that stands between the database sources and 
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the final user that will be allowed to see and use only part of the complete 
set of information. In this way we are able to address two unusual aspects, 
at the same time: the semantic integration between database sources and the 
implementation of security policies, where the “constraint” operator plays the 
major role. 

Section 2 introduces the syntax of the language, an intuitive explanation of 
composition operators, and an abstract semantics. In section 3 we discuss an 
implementation of MedLan beised on an extension of the semi-naive computa- 
tion rule for deductive data bases. The complete proof is not included here for 
shortage of space, but can be found in (Renso 1998). In section 4, an example 
of its application to security is given. Finally, in section ?? we conclude. 

2 THE MEDLAN LANGUAGE 

We consider a set of meta-level operations for composing definite logic pro- 
grams, originally introduced in (Brogi 1993, Brogie^ al. 1994, Aquilino et 
ai 1995) Union (U), Intersection (D), and Constraint (/). 

MedLan is the language of program expressions defined by these operations 
as follows: 

Pexp ::= Program | PexpUPexp \ PexpOPexp \ Pexp/ Program 

where Program is a named collection of clauses. Each set of clauses (program) 
is associated with a unique name by means of a global naming mechanism. 

In the sequel we will abuse the notation and use a program identifier to directly 
denote the set of clauses associated with it. 

More precisely, a program is a finite set of extended definite clauses of the 
form 

A f- , . . . , Bn 

where each is either an atomic formula or a meta-level formula of the form 
“C in Pexp'\ where C is an atomic formula and Pexp a program expression. 
A goal like “C in Pexp” introduces a form of message passing between object 
level program. The idea is that the program containing the goal “(7 in Pexp”, 
sends the message C to the “virtual” program denoted by “Pexp”. As usual, 
logical variables act as input/output channels between programs. 

We assume that the language in which programs are written is fixed. Namely, 
there is a fixed set of function and predicate symbols that include all function 
and predicate symbols used in the programs being considered. Moreover, pro- 
gram names and program composition operations are disjoint from all other 
constant and function symbols that may occur in programs. 
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Here we give the operators an informal semantics by means of examples. In 
section 2.1 we will show the formal (abstract) semantics. 

Consider the following programs P and Res. Dept: 

P: 

can.access.f older {a .inform ^ X) <— employee (X) in Res.Dept 
employee{john) f- 
employee{ann) f- 

Res.Dept: 

employ ee{mary) 

The first rule of P states that “a person can access a particular folder a.inform 
if he/she works in the research department, Res.Depf^ 

The query can.access.folder(a.inform^X) in the program P bounds the 
variable X to mart/ since the evaluation of the goal employee{X) is performed 
in the program Res.Dept. 

Given a program expression E, we show a plain logic program that behaves 
as the program expression, i.e. it provides the same answers to the same 
queries, whatever is the operational semantics in use. We refer sometimes 
to this program as to the virtual program denoted by the expression. Such a 
transformational approach, is useful for an intuitive understanding of program 
expressions. 

Consider the following plain programs: 

P: 

can.access.f older [a. Informix) <r- 

employee{X) in {Res.Dept U Direction.Dept) ^ 
ha s. author ization{X) in AutJ^odule 
employee{john) i- 
employee{ann) i- 

Res.Dept: 

employee[mary) <r- 
employee{fred) ^ 

Direction.Dept: 
employee{john) 4— 
employee(ann) 4- 

Aut.Module: 

has.authorization{mary) 4— 
has.authorization{john) 4— 
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Here, P gives access to folder aJnform to people working either in the 
Res-Dept or in the Direction_Dept. That is, “mary” , “fred” , “John” and “ann” . 
Because of the further condition that the person must also be authorized, as 
specified in the authorization module Aut.Module, the answer to the query 
can.access.folder{aJnform^ X) in the program P for the variable X will be: 
mary and john. 

The U operator makes the program denoted by the program expression 
Res.Dept U Direction. Dept behaves as a plain program containing the 
clauses of Res. Dept and the clauses of Direction. Dept, As the example 
shows, union may be used to factor knowledge in different modules. 

Intersection allows to combine knowledge by merging clauses with unifiable 
heads into clauses having the conjunctions of the bodies of the original clauses 
as body. The net effect is that the two plain programs act as sets of constraints 
one upon the other. 

Consider the following example: 



P: 

can. access, folder {a. in form, X) i- 

employee{X) in (Res.Dept U Direction J)ept), 
hasjauthorization{X) in {AutJAoduleC\ Validity jof J^ut) 
employ ee{john) ^ 
employee{ann) f- 

Res.Dept: 

employee(mary) f- 
employee(fred) ^ 

Direction. Dept: 
employee (john) 4— 
employee(ann) i- 

Aut.Module: 

has.authorization{mary) i- 
hasjauthorization(john) 4- 

ValidityjofJiut: 

hasjauthorization{mary) <— 

has jaccess{ john, f older .B, section ^ 

has.access{X ,Y, Z) 4- director jo f{Z,X) 

Now the answer to the query can.access.folder{a.inform,X) in the pro- 
gram P will give for the variable X the binding mary whose authorization is 
still “valid”. 

Notice that Aut .Module 0 Validity .of J[ut does not say anything about 
has. access, since nothing about has.access is deducible from Aut. Module. 

The constraint operator combines the features of union, intersection and a 
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simple form of negation to provide an asymmetric composition between a 
program P and a program Q where Q acts as a set of constraints for P as it 
is illustrated by the following example. Consider 

Constraint. module 

hasjauthorization(X) f- wasMuthorizedJ)y{Y^ X), isjdirector{Y^ Z), 
employee(X, Z) in Wrapper. Dept 

W rapper. Dept: 

employ ee{X^ res. dept) f- employee{X) in Res.Dept 
employee(X , direct jdept) employee(X) in Direction JDept 

V alidity.of.Aut: 

has. author iz at ion(mary) ^ 
has.access{johny f older .B^ section f- 

has.access{X^ y, Z) ^ director.of{Z^X) 

The following plain program behaves as the program expression 
C onsiraint. module /V alidity.o f .Aut. 

has. author iz at ion[X) ^ X ^ mary^ wasjauthorizedJby{Yy X), 
is.director{Y, Z)y employee{X^ Z) in Wrapper. Dept 
has. author ization(mary) wasjauthorizedJby{Y^ mary)^ isjdirector{Y^ Z), 

employee(mary^Z) in Wrapper. Dept 



While the following plain program behaves as the program expression: 
V alidity.of.Aut / Constraint. module. 

hasjauthorization{mary) was. authorized J)y{Y^ mary)^ isjdirector{Yy Z), 

employee(maryy Z) in Wrapper. Dept 
hasjaccess{john^ f older .B, section JJ) i- 
hasjaccess{X ,Y, Z) <— director.of{Z,X) 



Notice that the constraint is applied only to mary while the remaining 
knowledge in the module is not affected. 



2.1 Denotational semantics 

Now we give an abstract semantics. In (Renso 1998, Aquilino et al. 1995, 
Aquilino et al. 1997) also an operational top-down semantics is given and it 
is shown that the two semantics coincide. The semantics is limited to positive 
deductive data bases, and it is given in a bottom-up style by extending the 
standard immediate consequence operator (T(P)). 
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Recall that, for a definite logic program P, the immediate consequence oper- 
ator T{P) is a continuous mapping over Herbrand interpretations defined as 
follows (van Emden and Kowalski 1976). For any Herbrand interpretation I: 

A € T{P){I) <=> {3B: A ^Be ground(P) ABC/) 

where B is a (possibly empty) conjunction of atoms and ground{P) denotes 
the ground (i.e. fully instantiated) version of program P. 

Such an approach is motivated by the observation that the classical least 
model semantics is not compositional. That is, it is not possible to obtain the 
least model of, say, the union of two programs P and Q by homomorphically 
composing the least models of P and Q, In (Brogi 1993) it is shown that 
the Tp-based semantics is in fact both compositional and fully-abstract with 
respect to the repertoire of composition operations adopted in this paper. 

Definition 1 The semantics of program expressions is given as follows: 

TiEuF)(V = Te{I)UTf{I) 

T{ec\f)(V = Te{I) C{Tf(I) 

T{E/F)(I) = T(^Er\F)(I)U {T{E\\F)(I) 

T{E \\F){I) = T(E) \{A\Ai-Ge Ground{Q){B)} 

The immediate consequences of a program expression E constrained by a 
program F is a combination of the union and intersection operator and a kind 
of complement of a program w.r.t. a program expression. Informally, consider 
the case in which F is a plain program constrained by a set of clauses F. The 
resulting program is obtained by the union of two parts. One is the intersection 
of the two programs, that forces them to agree during the deduction. But 
intersection alone is not enough, because some clauses would be missing in 
the result. In particular, we miss all the clauses for predicates which are defined 
in E and not constrained by F. These predicates are of two kinds: the ones 
which do not have a definition in F, and those which have a definition in F 
that constrains only a subset of atoms potentially derivable in E. 



3 IMPLEMENTATION ISSUES 

Here we show how the T(P) semantics for our operators can be turned into an 
efficient bottom-up strategy by exploiting the technique of the so-called semi- 
naive computation strategy, that allows one to avoid redundant computations 
of recursive rules. 

The standard implementation of deductive databases is based on the bottom- 
up evaluation of logic programming, i.e. on efficient implementations of the 
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computation of the fixpoint of the immediate consequence operator. In this 
section we show how the simplest of these efficient implementations, i.e. the 
semi-naive computation strategy, can be extended to handle the new features 
of MedLan. We consider here only positive programs. Not even the extension 
to stratified databases is straightforward, given that, for example, the union of 
two stratified databases is not necessarily stratified. The definition of classes of 
stratified programs that can be handled by a semi-naive computation strategy 
is the subject of our current research. 



3,1 The Seminaive Evaluation Technique 

The seminaive evaluation technique (Abitebul, Hull and Vianu 1995, Ullman 
1988) is a straightforward extension of the immediate consequences opera- 
tor (also called naive evaluation), in that it avoids the recomputation of the 
same atoms, that might be triggered by recursive definitions, by focusing the 
computation only on the new atoms generated in the last step. 

We define the seminaive computation as an extension of the standard T(P) 
operator. Notice that we refer to the T(P) semantics described in section 2.1 
where we do not take into account the in feature. Extending the seminaive 
definition to include the message passing mechanism mirrors the approach 
presented in (Brogi, Renso and Turini 1997) and is matter of future studies. 

The new operator r{P) has two arguments. The first one represents the 
current interpretation, the second one. A, represent the set of facts computed 
in the previous step. 



Definition 2 Let P be a program, I, A interpretations, then r : (2^ x 2^) — >• 
(2^ X 2^) is defined as follows 

r (P) (/, A) = ({/ U A} , { A 1 3Pi , . . . , Pn , A ^ Pi , . . . , P„ G ground{P) 

A3i G [l,...,n] : P,- G A 
A{Pi,...,Pn}C/UA}-(/UA)) 



The powers of r are defined as usual 



Definition 3 Powers of r 



r(P)(0, 0)tO = (0,0) 

r(P)(0,0)ti = r(P)(r(P)(0,0)t(i-l)) 
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3.2 The Seminaive for Composition Operators 

The extension of the r operator for the composition operators is quite intu- 
itive. The bottom-up step of the seminaive function applied to P U Q is the 
set-theoretic union of the respective arguments, and an analogous definition 
works for intersection. 

t{PuQ) =r(P)(/,A)Ur(g)(/,A) 
r(Png) =r(P)(/,A)nr(g)(/,A) 



where 

{A, B) U (A', B') = {AUA',BU B') 

{A, B) n {A', P') = (A n yi', P n P') 

In the definition of the constraint operator we will use the Ground predicate 
with two arguments: the first one is a given program, and the second is the 
Herbrand base with respect to which we instantiate the given program. 



r(P/g)(/,A) = r(Png)(/,A)Ur(P\\Q)(/,A) 



The definition of r for the constraint operator has the same structure of the 
naive definition based on the immediate consequence operator. Since r works 
on pairs of interpretations we need to use the operator U, that performs the 
union of pairs. 



r(P\\Q)(/,A) = 

(r(P)(/, A)i \ {A I A ^ Pi, . . . , P„ G Ground[Q){Bv), 
r(P)(7, A )2 \ {A M 4- Pi , . . . , P„ G Ground{Q){Bv)) 



The auxiliary definition of r(P WQ) is such that, given a pair (/, A), it 
computes a new pair, containing atoms that can only be computed by using P. 
In (Renso 1998) the correctness of the semi-naive definition has been proved, 
as stated by the following theorem. 

Theorem 1 The seminaive definition for the composition operators is correct 
w.r.t the T(P) semantics of the operators. Let E be a program expression: 



(r(P)tu;)i=T(P)ta; 
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4 AN EXAMPLE FOR INTERNAL SECURITY 

The general idea of an architecture of a system for integrating database 
sources, according to a given set of security rules, that we have in mind is 
depicted in Fig.l. 




User identification 
Login-password 




Fig.l - A general architecture 



The architecture can be considered as consisting of three levels: 

• the first level hides those parts of the database source that we do not want 
to be visible outside; wrappers are defined for each database and they 
export only those parts of the databases that we allow to be visible to 
someone; each integrated database is assumed to be provided with secure 
transmission system and firewall. We only expect to interface with the 
underline database systems, (e.g. it can be any Java interface) that only 
provide us with a number of relations that can be queried according to the 
internal policies of the integrated database. 

• the second security level concerns the mediators layer. Here mediator mod- 
ules use data from wrappers to reason on it and to perform semantic inte- 
gration; Mediators can use data provided also by other wrappers and medi- 
ators. At this level, more policies can be implemented. They can depend on 
the application being implemented or it can be stated by agreements with 
the particular integrated database. That is, some exception to the internal 
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policy of a database can be allowed from the integrated database as long 
as some rule in the integrated environment are satisfied. As an example we 
can imagine that some pieces of information are made available to the in- 
tegrated environment manager and to the other database managers (in the 
integrated environment) but they must be hidden to some external users 
etc. 

• the third level of security is realized by a user identification system. It could 
be a login-password system or some other more sophisticated system, like 
a voice recognizer. No particular identification system is assumed in our 
models. 



Thus our model concentrates on the integration layer to provide for security 
of the application and of the integrated systems, i.e. for the “internal” core 
of the integrated system. 

The general architecture in Fig. 1 could be applied to the following situa- 
tion of integrating different databases of employees with permission assigned 
for each level, as it is handled by mandatory access control models. 

DBl 

employee(john,rossi, 001, dept 1, 100) 
employee(susan, white, 302, dept2, 108) 

DB2 

employee(f rank, green, 527, 70) 
employee(m€u:y, brown, 670, 65) 

WJ)B1 

employee(CodRef, Dept, Salary) <— 

employee(FName,LName,CodRef ,Dept, Salary)in DBl 

WJDB2 

employee(CodRef, nil. Salary) ^ 

employee(FNaiae, LNauae, CodRef, Salary) in DB2 

Mediatorl 

employee(User, CodRef , Dept, Salary) 4- 

user (User) , employee(CodRef , Dept , Salairy) in(WJ)B 1 U W JDB2) 

Mediator2 

employee(User,CodRef, Dept, Salary) f- 

employee(User,CodRef, Dept, Salary) in(Mediator 1/Security Jlules) 

Security .Rules 

employee(User,CodRef, Dept, Salary) has-permission(User)in Level_C 
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Level.U 

has^ermission(userl) 
has 4 >erm i s s i on(user 2 ) 

Level -C 

ha8_permis8 ion(user3) 
has^ermiss ion(user4) 

Level _T 

has-permiss ion(user5) 
has^ermiss ion(user6) 

LeveLTS 

ha8-permission(u8er7) 
has-permiss ion(user8) 

Please note that DBl and DB2 report on a set of employees that are rep- 
resented, in the two databases, with a relation that has the same name but 
different number of parameters. The information in the two different databases 
is merged into the employee relation of the integrated environment by means 
of the W-DBl and W_DB2 wrappers, respectively. 

Mediator 1 then defines a new employee relation that includes a user name 
(e.g. here it can be the name of the user who queries the integrated system), 
and it is defined by the set of employees “deducible” in the union of the two 
wrappers. 

Mediator2 define the same employee relation given by the same relation in 
mediatorl “constrained” by the rules in the “Security Rules” module. The 
rule in SR serves, in this example, to limitate the access to the information on 
the employees of the whole integrated system. In this example only users that 
have a LeveLC permission (user3 and user4), have access to this information. 



5 CONCLUSIONS 

Our research on deductive database systems and, in particular, on the problem 
of integrating different databases led us to the design of a language that sup- 
ports the notion of mediation via a suite of operators for combining collection 
of clauses and the “in” feature. 

The most interesting aspect of MedLan operators is the constraint (/) op- 
erator that allows for different applications (Asirellief al. 1998) that we have 
studied and developed in the past and that are presently further investigated. 

We paid special attention to a formal definition of the semantics of the lan- 
guage. The benefits of such an effort are that the operational and denotational 
semantics give us the way to implement a top-down and bottom-up evaluation 




53 



of a query through the mediator which also give Median the capability to deal 
with ’’virtual” and ’’materialized” view approaches (Hull 1997) 

One of the goal of this paper was the presentation of the semi-naive imple- 
mentation of the language. This bottom-up implementation technique, that 
allows an efficient processing of universal queries, has been extended to queries 
involving Median program expressions. 

We are currently studying different aspects of MedLan to extend it to cope 
with different integration mechanisms, such as cooperation and federation. 
Future work will concentrate on experimenting the characteristics of MedLan 
in various application domains. We find that the use of the constraint operator 
to deal with security issues is at the same time a demanding application field 
for MedLan and a promising solution. 

In fact, with respect to other approaches we believe that our approach is 
novel in many respect. In particular, we believe that one of the interesting 
aspects of our approach is the idea of using the notion of “view” for imple- 
menting the control of accesses. In this way, instead of checking the user rights 
for accessing some information, we give the user its own “allowed” view on 
the database, in a very straightforward way. 

Future work on the application we have presented will concern the study 
of different security models and enforcing policies existing in the literature to 
evaluate the feasibility of more complex examples and the effectiveness of our 
approach in real systems. 
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1 INTRODUCTION 



This article presents a general outline of the effects of the use of information 
technology in companies. The picture that emerges will show when and why IT- 
auditing became necessary as a profession and the reasons why the necessity is 
destined to become even greater in the future. 

First, the changing position of information technology within the economy is 
sketched in broad terms; after that the position is shown of information systems 
within the corporate economy of the future. Some salient points are mentioned for 
consideration in relation to security, continuity and reliability in this future picture. 
And finally, in the summary, certain myths are disposed of concerning the control 
mechanisms required for electronic processes and the challenges are described 
facing companies and IT-auditors. 

The subjects discussed in this article all concern the consequences of using 
computers for organizations in general; particular typologies are not elaborated. 

When a company has been automated to the point where its activities are recorded 
with no significant human intervention, then a situation may be said to have arisen 
in which the computer acts as a partner in the execution of corporate activities. 
Organizations that make use of all the forms of computer support described below 
for their company's processes may be classed as automated to the extent that the 
computer has become a partner in the execution of corporate activities: 
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• Orders for services and/or products are received by the organization by 
electronic means; 

• Warehousing is entirely managed by computer controlled robots; 

• Manufacturing processes are computer controlled; 

• All financial transactions with banks, suppliers and customers are completely 
automated; 

• Management tasks are fully supported by information supplied nearly 
exclusively by means of electronic processes. 

Such full ‘partnership’ has not yet been achieved. But it is already an important 
point of management policy at present and will be even more so in future, because 
there are a number of extremely important advantages inherent in this situation. 
Processes are carried out more cheaply and reliably in this way than when they are 
performed by people. Further on in this article we will examine the reasoning 
underlying this statement. People will be carrying out activities that are more 
interesting than the routine tasks associated with executing basic processes day by 
day, and they will attach greater importance to the (needs of the) company’s clients. 

There are a number of fundamental reasons, quite apart from specific company 
typology, why this ‘partnership’ with automation leads to an organizational 
structure that is quite different from that arising from a situation in which the tasks 
concerned are carried out by human agency. For instance, computers are not 
capable of self-seeking behavior. Processes that have been automated may be 
regarded as ‘systematic’. This means that, given identical input, and if the 
conditions of execution are always constant and the same basic files are used, these 
systems should, barring unforeseen circumstances, yield identical results. In the 
case of systematically automated processes it is thus important to ensure that 
organizational conditions are such that processes will do what they are supposed to 
do and that their execution will take place undisturbed. 

This ‘partnership’ between man and the computer is a natural development in the 
use of computers in our society. It is a development that can also be shown from 
the perspective of internal audit and security. For this reason, we have summarized 
the effects of automation on control structures and security measures within the 
organization in the past, before turning our attention to the specific features of the 
consequences of using the computer as a partner in business management. 



2 ROLE OF KNOWLEDGE MANAGEMENT AND INFORMATION 
TECHNOLOGIES IN THE CORPORATE ORGANIZATION 

If the past may be characterized by use of the word ‘mechanization’ and the 
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present by ‘automation’, the watchword for the future is ‘knowledge management’. 
The term ‘knowledge management’, in particular, is probably used here in a 
slightly different sense from usual, while our use of the term ‘automation’ may be 
doing less than justice to the current position. However, the usual meaning 
attached to these terms covers about 70% of the interpretation given in this article. 
Suggestions for better terminology are most welcome. A time scale for the three 
phases in given in Figure 1. 

Thus, three phases are distinguished, namely: 

• Mechanization: the computer takes over certain tasks from people without 
affecting the structure of the organization or the way people work; the 
computer clearly plays a supporting role; 

• Automation: the way people work is reorganized in such a way that important 
tasks and even entire jobs are carried out by computer; 

• Knowledge management: the jobs are carried out or even directed by 
computers, people support the computerized tasks and the computer is a 
partner in operational management. 



In the rest of this chapter we will examine the most important characteristics of 
internal control and security in these three phases in the use of computers within 
companies. 




mechanization automation knowledge 

(up to 1970) (up to 1995-2000) management 

(from 2000) 



Figure 1 : Tunc scale for phases in use of IT 



2.1 Mechanization 

The use of computers in industry in support of human tasks was necessary to 
achieve the requisite degree of efficiency and accuracy in operational management. 
The connection between computer support and human corporate activity was that 
at set intervals people had to prepare the results of their labors for processing by a 
computer. At other set times, they had to be in a position to take delivery of output 
from a computer, to assess it and to make use of it in carrying out their tasks. 
Processing was usually by batch processes and if any inaccuracies emerged, it was 
possible simply to repeat the process. If there was a certain critical time element, it 
was usually a matter of hours rather than minutes. Although it was generally 
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acknowledged that automated execution was more accurate (more reliable) and 
faster (more efficient) than was feasible in an exclusively human organization, 
people did not trust the computer to be sufficiently systematic or of sufficient 
integrity to render checking the results of the process unnecessary. 

During this generation of computer use, all activities (i.e. all output) were checked 
for completeness and accuracy before the results of the process affected 
operational management. 

Controlling the processing usually included integrity controls on the input and 
output of the computer center, while controls on the completeness and on accuracy 
aspects were not always carried out by the same organizational units. Checks of the 
accuracy of the input were carried out by those responsible for input, and checks 
were carried out by the computer center before processing began to check on the 
completeness of the input data. The computer center checked the accuracy and 
completeness of the processing before the output was dispatched to the user. 
Employees in the business department concerned checked the accuracy of the 
results of the process. The department to some extent also checked the 
completeness of the processing and then, preferably by someone other than the 
person who had checked the output. The systems were developed by professional 
developers who were responsible for adequate testing and implementation of the 
programs on a production computer (or for transferring these to the change 
management department and computer operators). 

There were no great inherent risks attached to this way of doing things because the 
user checked all input reports and the computer's output for accuracy. All output, 
moreover, was controlled for completeness. If any omissions were picked up, these 
were the fault of the computer center for not checking the completeness of the 
input data sufficiently thoroughly before the production processes started. The 
interdependence of the various production processes was managed by checking 
that all input data was present, before processing started. 

So long as computer centers were using tapes and punch cards, the operation 
formed an easily manageable whole. The introduction of disc units made it 
necessary to create guarantees within the organization of the computer center to 
make it possible to ascertain that the data on the disc units remained unchanged 
during the intervals between the various production processes. 

The ledger and networks of check sums maintained by personnel gave the 
company early insight into the completeness and accuracy of the management of 
the company as a whole. 
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2.2 Automation 

With automation, the first steps were taken towards further integration of the 
organization of tasks and the use of computerized processes. This led to the 
integration of the computer in the workplace, initially by using ‘dumb’ terminals 
and later personal computers. Human beings initiated transactions that were 
processed partly on their own personal computers and partly on a (more) centrally 
managed computer system. Operational management was organized in such a way 
that use of the computer was maximized. The clear distinction between input, 
processing and output, which all had their own specific check phases, disappeared. 
The promptness of automated support was perceived as the degree to which the 
response times of such support varied within agreed limits. These limits were 
defined as a particular percentage of the response time and were (and are) 
expressed in terms of a few seconds. 

This new way of processing data posed new problems for controlling and general 
manageability. The continuity (availability and response times) of automated 
support requires measures specifically designed for this new situation. These 
measures are carried out largely within the technical infrastructure and in the 
procedures of the computer center and the arrangements made with it, and not in 
the information systems and the organization of the users. In order to achieve the 
requisite accuracy in its processing results, the organization inevitably becomes 
dependent on the checking procedures within the system. This leads to a stricter 
control on process integrity, rather than a control on data integrity. Such controls 
on process integrity are carried out by means of extensive tests and acceptance 
procedures when new or modified systems are installed. The accuracy of the data 
is checked by the manner in which transactions are processed by the integrity of 
the automated processes. The authorization to introduce transactions must be 
preventatively controlled in order to prevent inaccurate processes. As a 
consequence of this issue, the integrity of the technical infrastructure must also be 
guaranteed. The method of processing often leads to individuals having authority 
delegated to them so that they can use the computer to carry out independent 
processing: this means that the tasks carried out are not checked by a second 
member of staff To perform checks, the applications are equipped with all possible 
relational controls, and individual authorizations are determined and entered into 
the automated systems so that the automated systems can check them. The 
completeness of the processing cannot be determined simply. Networks of check 
sums that are separate from the automated system often do not operate promptly 
enough; furthermore, these networks are often ineffective due to the processing 
volume. In order to be able to rely on automated networks of check sums, reliable 
automated operation of several processes that are independently organized and 
used is required. In other words, the implementation of automated systems must 
take place in reliable environments. 
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The organization controls the automated system and no longer directly controls the 
individual processing of specific input and the resulting specific output. The 
checks which, in combination, have to determine the integrity of processing are of 
a very preventative nature. The professional developer must supply systems that 
are explicitly tested and accepted by a user: the user must test that all the checks 
needed are present. Production is done separately from the development of 
systems, so that no unauthorized changes can take place in the production systems. 
The organization monitors the integrity of the change processes with respect to 
changes in the production environment. There is a form of access control that 
limits the access of all those involved in the organization (including the computer 
center personnel) to those functions for which they are authorized. There is also an 
explicit control on the access to and use of data by employees. The delegation of 
responsibility to use transactions and the limited functionality of the transactions 
are both designed to lead to a constant presence of inevitable conflicts of interest 
among the staff concerned. These conflicts of interest can also lead to retrospective 
opportunities to check the accuracy of the processing (networks of check sums). 
Where necessary, a risk analysis can result in transactions only being processed 
after explicit approval is given by an employee with the authority to do so. A 
distinction is made between transactions that are critical for the organization and 
other transactions that are not considered critical. The critical transactions are 
provided with meticulous preventative controlling measures within and through the 
automated systems, to prevent misuse and/or inaccurate results. The computer 
center uses a variety of automated processes that monitor the integrity of the 
process sequence. The completeness of the processing is checked using specially 
designed processes, which operate alongside transaction processing, or are run 
after the business processing has been completed. 

The organization can test whether the results of automated processing are complete 
and accurate. However, such tests require a strict control of the system of the 
automated processes and of the management of these processes. In order to 
determine that the organization has the requisite technical infrastructure available 
and is managing its information systems properly, an auditor must have explicit 
knowledge of automation techniques and of the relationship between automated 
processes. The professional IT-auditor has become necessary. 

2.3 Knowledge management 

Automated processes are becoming more and more intelligent. People are no 
longer able to verity the processing results. People accept that the results of the 
processing as complete and accurate, without additional control measures. The 
company employee no longer has to ‘feed’ the computer with transactions and 
check whether the processing is complete and accurate. The human being within 
the company is free to perform more effective tasks. He or she is then concerned 
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with activities such as providing services to others, selling services (often 
computer controlled services), purchasing raw materials, selling products, using 
information supplied by the computer, assessing irregularities and solving 
‘problems’ indicated by the computer. 

The data to be processed are entered ‘automatically’ outside the company, in 
warehouses or by production machinery. It is not possible to verify the accuracy of 
this input, because it fed directly to the automated systems. In particular, the 
responsibility for its accuracy and completeness lies with an external third party, or 
the input is obviously accurate precisely because it is generated as an integral part 
of the manufacturing process. The output resulting from the processing leaves the 
company again, or controls production machinery without human intervention. 

The differences from the previous phase as described in paragraph 2.2, and also the 
possible problems, are mainly associated with the required continuity of 
computerized support and the even more urgent need for preventative controls on 
the running of the processes. Systems must be continuously available to ensure 
smooth operations. It is not possible for employees within the organization to 
check the accuracy and completeness of the individual data supplied. The 
computer checks the relations between processes and systems and reports 
irregularities to the employees within the company. Completeness of the 
processing is ensured by the fact that everything which is supplied is also (one way 
or another) processed. 

The system controls its own processing completeness and sends messages to the 
human agent to resolve discrepancies, so that the system can guarantee the 
required continuous completeness. The whole organization of the process is set up 
in such a way that access authorization and the required recording of 
communication between systems and between the system and the human agent is 
completely regulated. The problems with respect to the retrospective checks on the 
operating processes are the same as those described in paragraph 2.2. 

All control procedures are of a preventative nature, with the exception of those 
measures that serve to determine that the preventative checking procedures have 
worked. All repressive control procedures are, however, also computerized. Man 
has adaptive controls at his disposal (facilities to direct the control procedures). 
These ‘human interventions’ are mainly activated using the management 
information made available to management and the analyses performed by man of 
possible systematic causes of requests generated by the computerized system to 
solve problems indicated. External factors can also induce management to 
introduce changes in processing parameters or in the process itself. 

The resources available today make it possible to organize knowledge management 
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so as to guarantee reliable and complete management information. The design of 
the computerized system is then such that the organization can rely on the 
computerized processes on which these demands are made to run with integrity on 
a continuous basis. The processes can be regarded as reliable because they have 
integrity (adequate test and acceptance procedures, all changes in systems are 
tested and approved). The input is, by definition, correct (the inputting bodies take 
the responsibility for this). The control tables etc. as well as all data have integrity 
by definition (plausibility checks, limited human intervention). The operational 
output and the data stored in computerized form have integrity (because the 
processing has). The (possibly necessary) processes for extracting data for 
management information have integrity (the same conditions apply to this as all 
other processes), as does the control information for the refinement process (the 
same grounds apply as for the other control tables). The management can request 
information from the computerized systems without negatively influencing the 
reliability of the systems, and the management is familiar with the procedures 
designed for this. In this way, management is provided with reliable information. 

The (alarming) result of this kind of knowledge management is that middle 
management has become superfluous. Top managers can identify their own 
information needs and obtain the desired information simply and without 
endangering the reliability of the basic processes. 

When in importance of the technical support of the business processes is not fully 
appreciated, there will be organizations that believe that ‘knowledge management’ 
has no special effect on the way auditing needs to be carried out. They will think 
that they can continue to audit responsibly as has been described in section 2.2, 
without technical IT-auditing. However, when knowledge management takes the 
form described in this section, companies will be forced to recognize that 
independent and impartial experts must be asked to judge the adequacy of the 
technical organization of the computer systems. This means an explicit assessment 
of the technical infrastructures and the organizations that design, implement and 
perform maintenance on these technical systems. Completeness of the control 
procedures in the technical environment and in the information systems, as well as 
the links between the two, is of vital importance to organizations. 

The measures needed to be able to use computers in this way are described in the 
following section. 



3 CONDITIONS FOR KNOWLEDGE MANAGEMENT 

Although from the point of view of 'knowledge management' the computer is 
clearly seen as a partner in the organization of the company and not merely as 
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support to human operation, the computer must also play a supporting role in the 
tasks which still have to be done by man. For this reason, before dealing with the 
facilities needed in IT organization, some conditions will be discussed which refer 
to the delegation of responsibility to the various company employees. The 
subsequently discussed facilities in the IT systems organization will be handled on 
three levels: 

• Technical infrastructure; 

• Structuring information systems; 

• User organization. 

Naturally, the totality of measures on the three levels must be consistent and 
efficient. This means that the segregation of duties required within the organization 
must be completely safeguarded in the structure of the information systems and in 
the technical infrastructure. When the same technical structure or the same 
information system supports several deliberately separated parts of the 
organization, it must meet particular requirements. This also applies to the IT 
organization described in section 2.2. 

The architecture of the cooperating computerized processes must make it possible 
for the computer to carry out the required detective controls independently. The 
computer is not self-seeking, and therefore the structure of computerized 
administrative procedures should not, in theory, need any modularity or interfaces. 
Later in this section we will show that the requirements of controllability mean that 
it is necessary to divide up processes in such a way that the computer can 
guarantee uniform processing. 

3.1 Authorization 

Authorization means the delegation of responsibility. As such, it is the technique 
that is used to effect the desired segregation of duties within the organization. 
Authorization is, consequently, a function that only occurs in situations where 
people are working together. Computers cannot delegate; even where there is 
partnership between man and computer. Maybe a subsequent evolutionary step will 
lead to the computer's powers being so superior to human ability that they will be 
able to manage people better than we ourselves can. In this situation, the computer 
could possibly perform the authorization function. 

The authorization specifications laid down by an organization are entered into the 
computer, which checks that people (can) only get access to the resources for 
which they are authorized. This is termed ‘access control’. Naturally, the computer 
must be able to determine the identity of the employee concerned (authenticating 
the identity entered, if necessary using 'identity proofs'). 
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Segregation of duties is applied first and foremost to limit the risks associated with 
delegating responsibilities to employees. This means that duties must be segregated 
both vertically and horizontally. 

Vertical segregation of duties limits the authority of staff members in relation to the 
structure and the design of computerized processing. The employee's 
responsibilities are determined in such a way that it is not possible for him or her to 
manipulate the links between computerized processes or to influence them in any 
other way so as to damage the interests of the organization. Introducing adequate 
vertical segregation of duties should result in the creation of access specifications, 
which can be entered into the computer so that access to computer programs and 
files can be restricted and monitored. 

Horizontal segregation of duties limits the financial powers of an employee's area 
of responsibility. This segregation of duties and the procedural consequences for 
the computerized systems should also be fully computerized. This will not be 
regulated by access control software only, but also by controls and procedures, 
which are built into the computerized administrative processes. 

When defining the segregation of duties, the organization must take into account 
the fact that the computer has become a partner in the organization of the 
performance of duties, and that man no longer has the opportunity to intervene at 
an appropriate time in computerized processes. The input supplied for processing 
will be checked by the system, but no longer on the initiative of man or by man. 
Man will, of course, take action when the system refuses input and refers it back to 
the human organization for correction. However, these referrals will be highly 
exceptional. 

3.2 Technical infrastructure 

It must be clear that not every use of automation will have to meet strict control 
and security requirements. The efficiency and effectiveness of the work carried out 
by the various employees of the company will, to a great extent, be determined by 
the technical infrastructure of the computerized systems. In order to prevent all 
computerized systems from being subject to one single regime of control, IT 
environments are distinguished which are sufficient for the different levels of 
control and security. The way in which the levels are technically separated depends 
on the differences in the required control and security typologies of the various 
environments. The differences arise from an analysis of the actual risks to the 
organization of the various classes of business information and processes. The first 
requirement for this level of IT support is therefore the management of the IT 
infrastructure in accordance with a classification into separate environments which 
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are derived from a risk analysis and a well considered decision process for general 
control and security procedures for each separate environment. 

The actual access control therefore does not only require detailed consideration of 
the consequences of the employees’ authorities in terms of access to data and 
processes, but also of the design of the links between separate environments so that 
no unacceptable risks arise of undesired communication between these 
environments. The role of the current access control software will be limited to the 
initial procedure for identifying the user and access monitoring for an IT 
environment. Within an environment, if necessary, the database management 
system will perform the most important monitoring of the access to data and 
processes (object monitoring). 

From a technical point of view, every environment will be regarded as an ’open 
shop environment’ or as a ’closed shop environment’. Most environments should be 
classified as ’open shop’ with varying degrees of controls on, firstly, the manner in 
which access is obtained, and, secondly, on the processes available in the specific 
environment. In open shop environments control will still be possible to varying 
degrees. These controls will check: 

• Integrity of the logging of computerized processes; 

• Destination of the output; 

• Security of the typology of the linked systems; 

• Identification of users; 

• Integrity of processes and data; 

• Continuity of processes and data; 

• Opportunities for making changes in the processes and the technical 
infrastructure. 

In a closed shop environment there will be guarantees concerning: 

• Integrity of the logging of computerized processes; 

• Integrity of the input and the processes; 

• Integrity of the technical infrastructure; 

• Scopes of responsibilities of explicitly identifiable users; 

• Desired continuity and availability of data and processes; 

• Integrity of all changes in processes and in the technical infrastructure. 

The destination of the output in a closed shop environment is not a specific point 
for attention because the technical infrastructure, the processes and the data 
guarantee the correct destination. 

The closed shop is the only environment suitable for applications in which the 
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control structure is so extensive that even the repressive measures are 
computerized. The remainder of this article deals exclusively with the closed shop 
environment. 




Figure 2: Strategy for distinguishing environments for information systems with common security 
requirements 



3.3 Structuring information systems 



As stated earlier, consistency between the segregation of duties in the organization 
and the architecture of the supporting information systems is extremely important. 
This segregation across different IT environments can mean that the information 
system is designed in such a way that different parts of the total process run in 
different open and closed shop IT environments according to their required 
security levels for processing. 

An automatic check is carried out into the completeness and accuracy of the results 
of the processes per source of activities, before going on to those procedures which 
lead to the critical processes executed in the closed shop environment and which 
lead in turn to deliveries and payments. The precondition for automating detective 
and repressive control measures is that the structure of the automated processes 
must lend itself to the establishment of controlled links between IT environments. 
For this reason, the modularity and the interfaces between modules must be 
designed so that the detective and repressive controls can be based on these. To 
achieve this, each type of source (and therefore also the link between the 
computers) must in itself have its own recognizable position in the architecture. 
Names of specific sources and destinations will be unique within the internal 
organization as a whole (banks, suppliers, purchasers etc.). Communication links 
which are needed to meet obligations laid down by the authorities (e.g. VAT 
return), possible external supporting bodies and external information sources will 
be included in the architecture in such a way that the closed shop characteristics of 
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particular environments are never affected by these links. 

There are processes in the critical closed shop environment, which generate 
forecasts relating to important parameters in the context of the movement of 
money and goods. These computer generated forecasts are used, at predefined 
moments in the processing, to carry out a completeness check before determining 
that conclusions can be drawn on the basis of the processes concerned. 

There are limits to the scope of all processes, depending on the classification of the 
process and the control specific consequences associated with the environment in 
which the process is carried out. The processes in a closed shop environment, by 
definition, have control and security procedures of such a quality that it is possible 
to check the completeness and the accuracy of the processes and their results in an 
open shop environment. The designed interfaces with the open shop environment 
are technically sufficient for this. The open shop environment cannot negatively 
influence the closed shop environment. 

Although the information systems' control structures must be consistent with the 
company's policy, the architecture of the information system is not dependent on 
the organizational structure within which the users of the information systems 
happen to be working. The architecture of the computerized information systems is 
designed to guarantee the internal integrity of the processes. Access control 
technologies monitor the segregation of duties in the user organization and the 
access to the computer by external bodies. 

3.4 User organization 

The users of IT systems are: 

• Computer center staff; 

• Professional system developers; 

• System programmers; 

• System engineers; 

• Business staff; 

• Other staff; 

• Manufacturing processes; 

• Computers not belonging to the legal entity. 

This list provides a starting point for deciding on the most important environments 
to be differentiated. The remainder of this section confines itself to the issue of the 
relationship between this list of users and the closed shop environment and the 
controllability of company operations. 
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The computer center staff will not have direct access to the closed shop 
environment. Should the closed shop environment experience technical problems 
that require human intervention or should the resources available to the 
environment need to be up or down graded then computer center staff will have 
momentary authorization to interact with the closed shop environment. Examples 
of situations include process or data structure reorganizations, introduction of new 
hardware, changing system parameters to allow safe preventative hardware 
maintenance. 

In other words, the system and process monitoring are completely automated. The 
conditions for 'real time' back up will be created on an ongoing basis by the system 
itself, and, in an emergency, the system will automatically switch over to the 
backup system. 

Professional system developers do not have access to closed shop environments. 
If troubleshooting is needed, all the necessary data and copies of processes are 
dumped on systems where the professional developer can make the necessary 
changes on the basis of analyses performed. The professional developer may be 
authorized to consult the closed shop environment during the troubleshooting. 
However, changes to the systems will only be entered in the closed shop 
environment after a second developer has explicitly checked the changes, the 
changes have been approved by the owner of the system and by the computer 
center. Hopefully there will be procedures that provide forecasts of the 
relationship between the cause, or the reason for the change, and the effect of the 
change so that an analysis can be made as to whether only required code changes 
have been introduced. 

Technical system managers (system programmers and/or system engineers) 
never have access to closed shop environments. System software is introduced into 
the production environment via a route guaranteeing optimum stability and 
reliability before the software becomes operational in the closed shop environment. 

In many cases, business staff is authorized to consult the environment to call up 
predefined standard output (i.e. no ad hoc 'query facilities'). There may be 
corporate staff members who are responsible for certain adaptive controls of the 
system and who can therefore alter certain very critical tables or data sets. It goes 
without saying that such a responsibility must be implemented in such a way that 
employees can be identified precisely enough for them to be subsequently called to 
account for their actions. It is also possible to design a procedure for interaction so 
that two employees are responsible for the accuracy of the input. 

Other staff will not have access to the closed shop environment but they will have 
access to other open shop environments. The closed shop environment will take 
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the data needed for the closed shop processes from the open shop environments in 
a safe and controlled way. Employees and other computers (internal or external to 
the organization) requesting free information will not have any access to the closed 
shop environment. 



Machines for manufacturing processes will, where necessary, have their own 
closed shop environments which are separate from the administrative closed shop 
environment because the automated systems for production will set higher 
continuity and response time requirements than are necessary for the 
administrative environment. 

The computers not belonging to the legal entity (for example: clients, 
consumers, government and banks) will not have direct links with the closed shop 
environment. 

This article may create the impression that the concept of human control has 
disappeared entirely from the internal organization. This is not the case. There are 
circumstances in the organization of the business activities will make a ‘back 
office’ necessary. This situation arises whenever there are functions in the internal 
organization for which no effective segregation of duties can be created and the 
materiality of an action cannot be limited with automated tools. The organization 
will then have to delay processing the results of the performance of these functions 
and actions by maintaining a second function in the company which checks the 
actions of those which performed the first function in a timely manner. This check 
is geared to the extent to which the functions are performed within the policy 
guidelines of the company. Those performing the check are authorized to take 
action in good time to limit the consequences of misappropriation. An example of 
the situation alluded to here is the purchase and sale of cash, where such purchase 
and sale occurs in the first instance through agreements between people. These 
agreements must then be recorded in the system preferably by means of an action 
performed by another person, independently of the agreement. If the computer can 
derive the agreements from a conversation, it is conceivable that the computer 
might be able to intervene to prevent unauthorized agreements being made. This 
will be conceivable when voice recognition becomes sufficiently sophisticated. 

Employees will have to be available in the human organization who can be 
directed by the computer to solve technical problems with which the computer 
cannot cope. These people contribute to the accuracy of the production process but 
they will not have direct access to the closed shop environment. They cannot 
adversely influence the integrity of the closed shop environment, other than by 
inadequate performance of their adjustment functions and tasks. 
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Management will have timely information about the company and the position of 
the company in relation to the outside world so that it can assess the desired 
performance of corporate operations in its entirety. Detailed reporting concerning 
the way in which responsibilities are carried out by the closed shop environment is 
required. These reports will be from the perspective of the processes and from the 
perspective of the corporate operations (current and forecast expenditure and 
income). They will offer management the possibility of intervening in good time 
by applying adaptive control measures to the system. 



4 SUMMARY AND CONCLUSIONS 

A number of myths concerning a sound structure for the organization of business 
tasks within a company must be dealt with. 

Human input does not need to be verified for the accuracy of the output to be 
considered reliable. If the company makes a third party contractually responsible 
for the input and has sufficient resources in-house to be able to demonstrate that 
the data cannot be distorted within the company, such verification is not necessary. 
Check sums are not needed to ascertain the integrity of automated processes in 
good time. Check sums are usually ineffective in a real time environment. The 
preventative control measures must be sufficient. 

Technical systems management, and the functions for developing systems and for 
troubleshooting, may not be given access to the closed shop environment. Their 
support must be absolutely unnecessary. If this position is taken, the required 
stability of the closed shop environment is achievable. The separation of 
responsibility for the computer center from the other tasks is a precondition to 
guarantee the necessary segregation of duties between development (and/or 
modifications in accepted systems) and the use of these systems. Comparing the 
individual outputs from systems with records on paper is an ineffective and 
extremely inefficient way of obtaining the required confidence with regard to the 
accuracy of the processing when the company works exclusively with automated 
processes. 

It is becoming clear that companies have to pay much more attention than before 
to achieving professional automated support of the business. An important 
requirement for this is the use of a clarification of processes, data and the internal 
organization according to their risks to the company in terms of the privacy of 
data, integrity and the overall continuity of the company. 

Measures to manage the technical infrastructure of the automated systems must 
take into account efficient and effective control procedures for the various classes 
of systems and environments. The management of the architecture of the 
information systems must be designed in such a way that the company can 
guarantee the presence of the required controls on the integrity of information 
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processing and on the links between the information systems. The computer center 
functions must be automated as fully as possible so that the operation of the 
technical infrastructure is controllable and human intervention is only required in 
very exceptional cases. 

The audit of a highly automated environment so that the company, the auditor and 
society can rely on these systems has become a requirement. An auditing body 
specializing in the field of auditing, automation and IT which is impartial and 
independent of the decisions regarding the design of the automated systems (both 
technical and application systems) will be used. Such an auditor can issue 
statements to society on the adequacy of compliance with measures required by 
law, such as the privacy legislation and the computer crime act. Social dependency 
on systems managed by certain computer centers will lead to the need to issue 
statements concerning the adequacy of the organizational design of these 
automated systems and of the computer centers. 

This article deals with the growing influence of automation on the current 
organization of companies. In particular the article is concerned with the 
implications of this growing influence on the control structures of a company. If 
considers TT’ as a 'partner' in the business operations and not merely a supporting 
tool for activities to be performed by human beings. It was stated that the 
automation has to be designed very tightly and in accordance with rigorous 
procedures so that the 'partnership' can be achieved properly. The growing 
influence of automation has also had an effect on the development of IT auditing. 
The knowledge and know-how required for advising a company about the best 
way to shape this partnership calls for auditors with an affinity for and experience 
with IT and the management of change. The adequacy of the design for use of IT 
must be assessed in relation to the overall internal organization and corporate 
objectives. 

It is unfortunate that a growing number of technologies are being developed which 
do not meet the requirements for rapid and effective use in the closed shop 
environment described in this article. The lack of attention to the development of 
interfacing techniques and protocols for safe communication between IT 
environments with different security levels also slows down the achievement of a 
more complete use of IT within modern businesses. 
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Abstract 

Security and control measures can be implemented as user controls, application 
controls or general controls. To safeguard the integrity of a database, general 
controls may be relied on. Or is it, for instance for highly critical applications, 
necessary to take additional measures in the form of application controls in the 
application itself ? Based on a risk analysis, in which risks were identified and 
measures taken to cover these risks, the Dutch Central Bank decided that additional 
application controls in the form of control totals were necessary. To realize these 
control totals a separate Control subsystem was designed and implemented. This 
document describes the control subsystem which is implemented in the payment 
system of the Bank. 

This paper will first give a overview of the payment system of the Dutch central 
bank. In the second section will be described how is dealt with quality assurance on 
security and control during application development. In the sections 3 to 9 the 
control subsystem is specified in detail. 

Keywords 

Risk analysis, quality assurance, standards on security and control, application 
controls, control totals, financial transactions. 
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1. OVERVffiWOFTOP 

1.1 Flow of payments in the Netherlands 

In November 1997 the payment system of the Dutch central bank, called TOP, 
became operational. The system is an automated real-time gross-settlement system. 
A major function of the payment system is to facilitate funds transfers to the 
participants in the system. As part of this function the Bank processes payments, 
totalling tens of billions of guilders each day, using an automated large- value 
settlement system. 

Different kinds of payments are processed through TOP. The following table shows 
the average number of transactions per day and the average amount per transaction 
(in Dutch guilders). 

Table 1 Average number of transactions per day and average amount per 
transaction 





Average 
number of 
transactions 
per day 


Average 
amount per 
day 

(in billion 
NLG) 


Average 
part in total 
number in 

% 


Average 
part in total 
amount in 

% 


Retail payments 


7.000.000 


6,9 


99,55 


5,3 


Urgent retail payments 


18.000 


3,0 


0,26 


2,3 


Settlement of trades of the 
Amsterdam Exchange 
(wholesale) 


2.800 


38,2 


0,04 


2,8 


Interbank payments 


1.100 


38,2 


0,02 


29,5 


Non-resident payments 


9.500 


77,9 


0,13 


60,1 



Retail payments are processed by Interpay, a clearing house founded by the 
commercial banks. All commercial banks, as well as large business clients, have a 
data communication connection with Interpay. Clients of commercial banks may 
submit their transactions to these banks by means of special forms or by data 
communication. All transactions of Automated Teller Machines are also submitted 
to Interpay. 

Interpay stores and forwards the transactions received to its clients (commercial 
banks and large business clients). The bulk of this information is also presented on 
tape or by cartridge. Every day. Interpay calculates the outstanding balances of the 
commercial banks. Once a day, these outstanding balances are settled in TOP. 
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In addition to normal retail payments, the systems and infrastructure described are 
also used for urgent retail payments. These kinds of payments have a maximum of 
10 million Dutch guilders. These orders are advised to the ultimate benificiary 
based on the guaranteed settlement by the end of the day in TOP within an hour of 
being submitted. 

Figure 1 Overview of the payment infrastructure in the Netherlands 



Dutch Central Bank 




Interpay 



As the overview of the payment infrastructure shows, FA payments are sent directly 
from the participant to the Dutch Central Bank. The non-resident payments are 
routed via the network infrastructure of Interpay. Interpay routs the non-resident 
payment orders to the Dutch Central Bank. The retail and urgent retail payments 
are processed by Interpay where net settlement takes place. The outcome is sent to 
the central bank to be cleared. 

1.2 Participants in TOP 

There are about 300 account holders in the Dutch Central Bank’s payment system. 
These include: 

commercial banks 

Ministry of Finance 

Agency of the Ministry of Finance 

interbank clearing house (Interpay) 
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other central banks 
international development banks 

The latter two account holders do not have an on-line connection with TOP. Their 
payment orders are sent to the Dutch Central Bank, where the input is processed by 
the Payments Department. 

1.3 Input and output 

Orders can be sent to TOP via the TOP end station (TES, which is a dedicated PC), 
SWIFT for cross-border payments (TARGET), hostlink connection, 
tapes/cartridges and paper. Information from TOP sent to the participants is 
received by the TOP end station, hostlink connection, SWIFT and paper. 

1.4 TOP subsystems 

TOP comprises eight different subsystems. This paragraph briefly describes the 
subsystems with the most relevant functions. 

Pre-processing subsystem 

In this subsystem the messages from all participants are received. These can be 
individual messages, tapes and batches. After having carried out controls like 
authentication, reliability, format and layout, etc., the messages are processed 
further. The tapes and batches are unpacked and transformed into individual 
messages. 

Transaction-processing subsystem 

This subsystem contains among other things the functions for the settlement of the 
payment orders and the queuing mechanism. Also, controls like access control to 
the account which has to be debited and the check of sufficient cover are carried 
out. 

Post-processing subsystem 

In this subsystem all the output is made ready for sending to the participants. This 
can be done by individual messages, but also batches and tapes can be formed. 

Control subsystem 

The Control subsystem has links with the three subsystems listed above. 
Information about the processing is written to the Control subsystem. With this 
information the Control subsystem executes reconciliation controls during the 
working day. In this paper this subsystem will described in greater detail. 

Figure 2 shows the relation between the Pre-processing, Transaction-processing 
and Post-processing subsystems and the Control subsystem. 
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Figure 2 Relation between the Pre-processing, Transaction-processing and Post 
processing subsystems and the Control subsystem 



Input 



order 

(AEH) 



result 

(URE) 



output 



Pre- 

processing 





Transaction 

processing 




1 





Post- 

processing 











*) 







*) 



Control 



*) Information about the processing, e.g. a change of status of the transactions, is 
written to the Control subsystem. 

To provide a complete overview of TOP, the other subsystems are described very 
briefly: 

Maintenance of permanent data subsystem: this system contains the functions to 
input, change or delete permanent data. 

Periodical processing subsystem: this subsystem contains the functions for 
housekeeping purposes at the end of a working day, week, month or year. 

Interest and costs subsystem: this subsystem periodically calculates interest and 
costs, changes these into orders and offers them to the Transaction subsystem for 
further processing. 

Inquiry subsystem, this subsystem processes the inquiries made by participants. 



1.5 Environment 

TOP application is operational with DB2 as a database management system. 
General controls such as access control (RACE) and problem and change 
management procedures are in place. 
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2. QUALITY ASSURANCE ON SECURITY AND CONTROL 

2.1 General 

The Dutch Central Bank is well aware of the necessity of quality assurance on 
security and control during the development of applications. To ensure that the 
right level of measures is included in the application, a project organization is 
implemented in the Bank and standards on security and control are issued. The 
standards describe the steps that must be taken to make sure that the risks which 
can be defined for the application are identified and covered by the appropriate 
measures. The standards have to be followed by the employees who develop the 
application, and are also used by the system auditors of the Internal Audit 
Department while performing the audits. 

2.2 Standards on security and control 

The standards regarding the security and control can be clarified as follows. 
Structured approach 

The main theme is the risk analysis. During the development of the application, 
risks are identified and measures are taken to cover these risks. When the 
development of the application makes progress, the risks are identified in greater 
detail. The risk analysis is subdivided into a number of different steps, which are 
described below. 

System characteristics 

A brief description is made of the application, highlighting the most important 
characteristics. Attention must be given to the quality criteria of 
reliability 
controllability 
confidentiality 
continuity 
authorization 

The description has to point out where the risks could be. Answers must be found 
to questions such as: Is it a financial system? Are the data which are processed 
confidential?, etc. Based on this description a first, global classification is made of 
the critical character of the system. The classification can be high, medium or low. 
In the standards on security and control criteria are mentioned which will help to 
classify the application. 

The result of the classification of the application is mapped with the platform 
classification. All the platforms used within the Dutch Central Bank are classified. 
These platforms are PCs, local area networks, mid-range computers and main 
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frames. It is specified for these platforms what the standard level of security and 
control is. This is done for the above mentioned quality criteria and classified in 
high, medium and low. In this way it is easy to see whether the application security 
requirements meet the standard security measures of the chosen platform. If this is 
not the case for one or more of the quality criteria, additional measures are 
necessary or another platform has to be chosen. 

Risk inventory 

In this step the risks are identified in much greater detail. The risk analysis is 
conducted during the stages of development ‘definition study’, ‘global design’ and 
‘detailed design’. During the development of the system, the risk analysis is 
developed from its global nature (definition study) into a detailed product (detailed 
design). Table 2 presents the relationship between the steps of the risk analysis and 
the relevant phase of system development. 

Table 2 relationship between the risk analysis and the phase of system development 





Informa- 
tion plan 


Definition 

study 


Global 

design 


Detailed 

design 


Test phase 


System characteristics 


G 


D 








Risk inventory 




G 


D 






Requirements of 
security and control 

- whole system 

- subsystems 




G 


D 

G 


D 




Measures of security 
and control 
(subsystems included) 






G 


D 




Testing (subsystems 
included) 










D 



After the risk analysis is completed, the measures for covering the risks are 
specified. The results of this exercise are recorded in a document called the security 
chapter. This chapter is part of the detailed design. 

The security chapter for TOP is specified in detail for the different subsystems and 
the quality criteria. Table 3 shows the matrix in which the relationship is given 
between the subsystems and the quality criteria. Table 4 presents a part of the 
detailed risk inventory for the Transaction-processing subsystem. The risk that is 
specified in detail is correctness and integrity. This is Scheme 2 from Table 3. 




Table 3 Relationship between subsystems and quality criteria 
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Table 4 Part of the detailed risk inventory for the Transaction processing subsystem 
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Testing 

During the testing phase it is determined whether the measures of security and 
control are implemented in the application. Tests are performed by different 
employees of the Bank. Employees of the Automation Department System 
Development perform tests to determine whether the product they developed is in 
accordance with the specifications. The end-user tests to determine if the 
application operates in accordance with the specifications they have approved. 

3. CONTROL SUBSYSTEM 

3.1 Objective of the Control subsystem 

The objective of the Control subsystem is to guarantee the reliability of TOP'S 
transaction processing. Transaction processing includes all functions of TOP 
application with respect to: 

the receipt and registration of incoming messages in the TOP database, 
the processing of financial transactions, especially as regards account 
balances, 

the sending of output messages which are related to the above-mentioned 
functions. 

3. 2 System characteristic 

The Control subsystem starts automatically at the beginning of every working day; 
it is one of the subsystems started first. The supervisor of TOP can influence the 
parameter setting which can affect the length of the interval between control runs. 

In normal situations the Control subsystem executes controls every half hour. 
During the day the Control subsystem can be switched off by making the length of 
the interval as long as the working day. However, the last control run will always be 
executed, it can not be switched off The first and the last control run are extensive 
in the sense that all controls are executed. During the day, when every half hour 
controls are executed these will include a subset of the controls. 

The possibility of changing the interval between the control runs has been 
implemented in the system because in the events of errors the user wants to have 
the time to solve the problem without the Control subsystem generating more 
warning messages. Changes in the parameter settings are audited by the internal 
control section of the Payments Department. 

Every time that the Control subsystem executes the controls, a (limited) report is 
generated. This report consists of a few standard forms which can be read very 
easily (visual control at a single glance). When errors are detected the supervisor is 
informed, not only by the above-mentioned reports but also by a warning message 
which will be displayed on his terminal. The supervisor has to acknowledge the 
warning by using the Enter key. The warning messages are ranked from messages 




85 



related to errors which do not have a great impact to messages related to errors 
which can have a great impact on the data in the database. The errors are specified 
in the reports. 

The supervisor has to take adequate action on the basis of the warning messages. 
Errors detected by the Control subsystem never cause an automatic stop of TOP. 
This action has to be performed by the supervisor himself. 



3.3 Main parts of subsystem Control 

The Control subsystem consists of two main parts: 

A. Control totals 

B. Control log 

A. Control totals 

Based on the message-oriented character of the TOP application as a whole, where 

- via different media - incoming messages act as triggers for the transaction 

processes, the following description can be made: 

A. incoming messages lead to registered delivery units (AEH); 

B. delivery units lead to orders (OPD), in some cases through provisional orders 
(VOP) when the input has taken place via the TOP end station (TES); 

C. orders (OPD) lead to financial transfers (OVB), future financial transfers 
(GOB) or actions on financial transactions or account data (e.g. an order to 
withdraw a future transfer); 

D. on reaching the value date, future transfers lead to orders (OPD); 

E. transfers, when inserted in a queue for settlement or processed in the cover of 
the account and/or settlement; lead to changes in the account processing data; 

F. depending on specific circumstances the above-mentioned transaction processes 
lead to output messages at different moments. 

These relations have been translated into the control totals, which are described as 

follows: 

Control totals (I)- AEH versus OPD 

1. completeness of registered delivery units (AEH), when a AEH is registered 
the Control subsystem starts the control totals; 

2. reconciliation between the approved and rejected AEH and the number of 
registered AEHs; 

3. reconciliation between the number of orders (OPD) and AEHs; 

4. reconciliation between the number of OPDs which are selected to different 
phases of processing and the total number of registered OPDs; 

5. reconciliation between the number of approved provisional orders (VOP) and 
the number of OPDs which are input through the TOP end station. 
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This part of the control totals is described in Section 4. 

Control totals (II) - OPD versus OVB/GOB 

6. reconciliation between number of orders (OPD) and number of financial 
transfers (OVB); 

7. reconciliation between number of OPDs and number of future financial 
transfers (GOB); 

8. reconciliation between number of GOBs which are transformed at value date 
and number of OVBs which are the result of this action; 

9. reconciliation between number of withdrawals and cancelled OPDs and the 
number of cancelled OVBs and GOBs. 

These control totals are described in Section 5. 

Control totals (III) - account versus order 

10. reconciliation amounts of financial transfers OVBs and the changes in 
account data (total control); 

1 1. reconciliation amounts of OVBs and the changes in account data (controls 
in detail, per individual account); 

12. consistency controls of individual accounts. 

These control totals are described in Section 6. 

Control totals (IV) - number of output messages versus number of 
processed OPDs 

13. reconciliation between number of output messages and number of processed 
OPDs, OVBs and GOBs; control per type of output message 

These control totals are described in Section 7. 

Control totals (V) - number of GOBs today versus previous day 

14. reconciliation between the number of GOBs which are present today and the 
number of GOBs present at the end of the previous working day. 

These control totals are described in Section 8. 

Control totals (VI) - completeness of all output messages 

15. completeness control of the processing of all output messages. 

These control totals are described in Section 9. 

The relations between the process flow and the controls mentioned above are the 
following: 
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A. incoming messages lead to registered delivery units (AEH). 

Controls 1 to 5 

B. delivery units lead to orders (OPD), in some cases through provisional orders 
(VOP) when the input has taken place via the TOP end station (TES). 

Controls 6 to 9 

C. orders (OPD) lead to financial transactions - transfers (OVB) and future 
financial transfers (GOB) - or actions on financial transactions or account data 
(e.g. an order to withdraw a future transfer). 

Controls 10 to 12 

D. on reaching the value date, future transfers lead to orders (OPD). 

Control 9 

E. transfers when inserted in a queue for settlement or processed in the cover of 
the account and/or settlement, lead to changes in the account processing data. 

Controls 10 to 12 

F. depending on specific circumstances the transaction processes lead to output 
messages at different moments. 

Control 15 

B. Control log 

The goal of this control is to determine that, at the moment the control run is 
executed, the actual (financial) balances have been completed in a reliable way. All 
of the primary processes that take place within the Pre-processing, Transaction- 
processing and Post-processing subsystems write the actions they execute to a 
specific control log in the form of records. With every control run of the Control 
subsystem the control log is counted from the beginning and the totals are 
reconciled with the end balances of the TOP database at that moment. 



4. CONTROL TOTALS (I) - AEH VERSUS OPD 

4.1 Goal 

In this part the control totals concerning the AEH and the OPD are specified in 
detail. The goal of this part of the control totals is as follows: 

today's registered AEHs are consistently classified into approved and rejected 
AEHs. The fact that an AEH is approved or rejected can be deduced from 
specific characteristics (attributes) which are checked by the Control 
subsystem; 
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the number of registered OPDs are in accordance with the number of 
approved AEHs; 

OPDs subduvided in various states of processing are in accordance with the 
number of AEHs; 

especially when viewed to medium of origin, reconciliation of number of 
registered OPDs via the TOP end station and the number of approved VOPs. 
The above-mentioned control totals are reproduced in the diagram presented in 
Table 5. This report is an example of one of the control reports generated by the 
Control subsystem and is a part of the actual report. 

In the first column the transaction types are listed. In the columns A to E the state 
of processing is specified. In column F the totals of the columns A to F are 
presented. Table 5 shows a few of the possible transaction types. If a combination 
between row and column is not logical, a "null" is presented. 

4.2 Description of the control totals 

Incoming messages which can contain financial orders are registered as Delivery 
units (AEH). The registration takes place in a new record in the DB2 table AEH. 
AEHs can be in the form of a message which contains one order or in the form of a 
batch or tape which can contain one or more orders. After the receipt of the 
message, batch or tape one of the controls that is executed is that TOP checks if the 
number of orders is in accordance with the information which is included in the 
batch or tape (header or trailer). The number of orders is registered in the record of 
the entity AEH (total of orders advised). The approval or rejection of the AEH is 
registered in the record concerning the AEH. The reconciliation between the 
number of registered AEHs and the number of approved or rejected AEHs is 
presented in the second part of Table 5. 

Next, the orders are individually checked on several points. Hereby the individual 
orders can be approved or rejected by TOP. The number of approvals or rejections 
is also registered in the record of the concerning AEH. Part 1 of table 5 presents the 
reconciliation between the total of advised orders which should be in the AEH and 
the number of approved or rejected orders. Also, a distinction is made between all 
the AEHs and the ones that are older than one hour. This is done because when a 
difference occurs between these numbers a signal is given to the supervisor that 
there is something wrong. 

After the AEHs have been completly checked the approved orders are processed 
further. The AEHs are transformed into orders (OPD) which can be processed by 
the Transaction processing subsystem. The transformation is registered in the TOP 
database; a new record is written in the DB2 table OPD. The orders which are 
registered as OPD but are not yet processed further are presented in column A of 
Part 3 of Table 5. After executing different controls by the Transaction processing 
subsystem an order can be approved or rejected. The number of rejected OPDs is 
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presented in column B. The approved OPDs are processed further and the OPDs 
which are financial orders are thus transformed into financial transfers (OVB). The 
financial transfers which have a value date that is in the future are transformed into 
future financial transfers (GOB). The numbers of the financial transfers (OVB) are 
presented in column C, the number of future financial transfers in column D. The 
number of orders which are not financial transfers, such as pointing a transaction in 
front of the queue for settlement, are presented in column E. 

The total number of orders in the AEHs presented in Part 1 must be in accordance 
with the total number of today’s registered OPDs, which are presented in column A 
in Part 3. 

Table 5 Report of the Control subsystem, reconciliation AEH versus OPD 
Part 1 



Control attributes of entitv AEH 






• total number of orders in AEH 


G99 (all AEHs) 


G97 (AEHs older 


approved 




than 1 hour) 


• total number of orders in AEH 


H99 (all AEHs) 


H97 (AEHs older 


rejected 




than 1 hour) 


• total number of orders in AEH 


199 (all AEHs) 


197 (AEHs older 


advised 




than 1 hour) 



Part 2 



Control registration AEHs 






Total today's registered AEHs 


198 (all AEHs) 


196 (AEHs older 
than 1 hour^ 


• number approved 


G98 (all AEHs) 


G96 (AEHs older 
than 1 hour) 


• number rejected 


H98 (all AEHs) 


H96 (AEHs older 
than 1 hour) 



Part 3 





Column A 


Column B 


Column C 


Column D 


Column E 


Column F 




OPDs 
registered 
today and 


OPDs 
registered 
today and 


OPDs 
registered 
today and 


OPDs 
registered 
today and 


OPDs 
registered 
today and 


Totals of 

today's 

registered 
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not yet 

processed 

by 

Transaction 

processing 


rejected 
today by 
Transaction 
processing 


processed 

by 

Transaction 

processing: 
approved 
and OPD 


processed 

by 

Transaction 

processing: 
approved 
and GOB 


processed 

by 

Transaction 

processing: 

approved 

and 

processed 
(no OVB or 
GOB) 


OPDs in 
different 
stages of 
processing 

Totals 


Non-resident 
payment MT 
100 


A1 


B1 


Cl 


D1 




FI = A1 + 
Bl-i-Cl+Dl 


Non-resident 
payment MT 
202 


A2 


B2 


C2 


D2 


mn 




Non-resident 
payment MT 
205 




B3 


C3 


D3 


H 


F3 = A3-I- 
B3-I-C3-I-D3 


FA payment 
MT 100 


A4 


B4 


C4 


D4 


Null 


F4 = A4 + 
B4-I-C4-I-D4 


FA payment 
MT 202 


A5 


B5 


C5 


D5 




F5 = A5 
B5-I-C5-HD5 


Withdrawal 

GOB 


A6 


B6 


C6 


D6 




F6 = A6 + 
B6-HC6-HD6 


Approval 

batch 


A7 








El 


F7 = A7-I- 
E7 


Transaction 

pointing 


A8 


B8 






E8 


F8 = A8-I- 
B8 + E8 


Change of 
space of 
reserved cover 


A9 


B9 




Null 


E9 


F9 = A9 + 
B9 -I-E9 


Order without 
check of 
sufficient 
cover 










ElO 


F10 = AlO 
+ B10-H 
ElO 


Transaction 
pointing by the 
supervisor 


All 


Null 


Null 


Null 


Ell 


Fll = All 
+ EU 


Etc. 


An 


Bn 


Cn 


Dn 


En 


Fn 




Total 


Total 


Total 


Total 


Total 


Total 




column A = 


column B = 


column C = 


column D 


column E = 


general 




A1-I-A2-I-A3 


B1 -i-B2-»- 


Cl +C2-H 


= 


E7 -h E8 -1- 


Rows FI to 




+A4-I-A5+ 


B3 -HB4-H 


C3 -1- C4 -1- 


D1 D2 -1- 


E9-HE10-f- 


Fn = totals 




A6+A7-I-A8 


B5 + B6 -h 


C5 + C6 


D3 + D4 


Ell-I- 


Columns 




-I-A9-I-A10-I- 
All +An... 


B8 + B9 + 
BIO + Bn 


Cn... 


D5 +D6 + 
Dn... 


En... 


A to E 


























91 



5. CONTROL TOTALS (H) - OPD VERSUS OVB/GOB 



5.1 Goal 

In this part the control totals between the order (OPD) and the financial transfer 
(OVB) or future financial transfer (GOB) are specified in detail. The goal of this 
part is to check that: 

the number of today’s registered OPDs is in accordance with the number of 
financial transfers (OVB) or future financial transfers (GOB); 
the reconciliation between the OVBs and GOBs - subdivided into different 
stages of processing - and the calculated control totals for both numbers and 
amounts. 

the withdrawn and cancelled financial transfers are only the result of specific 
present and registered orders. 

The above-mentioned control totals are reproduced in the diagram presented in 
Table 6. An order can be processed in different ways. The way an order is handled 
depends on different criteria. In the first column the different ways of handling an 
order are presented. 

5.2 Description of the control total 

After an incoming message has been accepted by TOP and transformed into an 
order, the processing continues by transforming the order into a financial transfer 
(OVB) with a value date of today or into a future financial transfer (GOB) with a 
value date which lies in the future. 

On every new working day when TOP starts the future financial transfers are 
checked on the value date. When the value date of the future financial transfer is 
the same as the actual value date it is transformed into an order (OVB). If the value 
date of the financial future transfer is later then the actual value date it remains a 
GOB. It is also possible that a GOB is rejected by TOP, for example because the 
information in the GOB is conflicting with master data. The state of processing of 
the GOBs is presented in columns J to P in Figure 4. 

The financial transfers (OVB) are processed further in that the check of sufficient 
cover is executed for the account to be debited. If there is enough cover the 
financial transfer is settled. If the cover is not sufficient the transfer is placed in a 
queue. The moment the cover is changed and is sufficient to settle the first transfer 
in the queue, settlement will take place. It is also possible that the cover is blocked 
and the final settlement will take place at a later moment. The state of processing of 
the OVBs is presented in the columns Q to R. The report generated by the Control 
subsystem contains more columns and rows, in fact, but has been shortened for this 
paper. 
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The control totals are calculated for the number of transactions as well as the totals 
of the amounts of the concerning transactions. It is checked whether: 

- the totals of the numbers and amounts of OVBs and GOBs are only in the 
rows and columns where, logically they can only be present; 
the totals for both the numbers and the amounts on the axes of the matrix 
(column ZZ and the totals at the bottom of the columns) correspond. 
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6. CONTROL TOTALS (HI) - ACCOUNT VERSUS ORDER 

6.1 Goal 

These control totals are subdivided into a limited version and an extensive version. 
Both versions check the same data. The limited version is executed during the 
working day, the extensive version is executed at the end of the working day. The 
controls in the extensive version are more detailed. 

The goal of these controls is that the amounts in the financial transfers as registered 
in the TOP-database are in accordance with the process data concerning the 
accounts. Partly for performance reasons a few fields are kept with data which is 
often used by TOP. The data in these fields is kept up to date to prevent TOP 
calculating the wanted data again and again. 



6.2 Description of the control totals 

The report which is generated by the Control subsystem is presented in Table 7. 
Table 7 Report of the Control subsystem, reconciliation account versus order 





Contents of 
account (REK) 
control totals 


Contents of financial 
transfers (OVB) 
control totals 


Sum opening balance 


Amount 




Sum settled debit 


Amount 


Amount 


Sum settled credit 


Amount 


Amount 


Sum balances 


Amount 




Sum queue (number) 


Number 


Number 


Sum queue debit (today) 


Amount 


Amount 


Sum queue credit (today) 


Amount 


Amount 


Sum queue debit (future) 


Amount 


Amount 


Sum processed in cover 
debit (today) 


Amount 


Amount 


Sum processed in cover 
credit (today) 


Amount 


Amount 


Sum processed in cover 
debit (future) 


Amount 


Amount 


Sum of credit facility 


Amount 




Sum cover 


Amount 
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The first part of the report contains two controls. The first is the control to 
reconcile the opening balances, the settled amounts debit and credit and the closing 
balances. Debit and credit must always be the same as the total amount of the 
opening and closing balances. The settled amounts debit and credit are also 
compared with the information registered for the financial transfers (OVB). 

The second part of the report reconciles the number of transactions which are 
queued and the sum of the amounts to be debited or credited today and the amounts 
to be debited in the future. The information registered in the database for the 
accounts (REK) is reconciled with the information registered in the database for the 
financial transfers (OVB). In Table 7, the amounts presented are the total for all the 
accounts and financial transfers when the limited version is executed. The report 
generated when the extensive version is executed gives the same information but 
then the totals are presented for all the individual accounts. 



7. CONTROL TOTALS (IV) - NUMBER OF OUTPUTMESSAGES 
VERSUS NUMBER OF PROCESSED OPD’S 

7.1 Goal 

The goal of this control is to, during the last run of subsystem Control, reconciliate 
for a number of types of outputmessages that the numbers which are sent are equal 
to the number in the database. This control is executed only at the end of the day. 

7.2 Description of the control total 

Table 8 presents the report which is generated. In fact. Table 8 only shows the part 
of the report which is generated by the Control subsystem. In the report, the 
reconciliation is presented between received transactions and the output messages 
which are the result of the processing. The first part presents the reconciliation 
between the number of settled transactions and the number of debit and credit 
advices sent. All three must be identical. The number of settled transactions should 
correspond with the number of settled transactions presented in Table 5. 

In the second part the reconciliation is presented for the non-processed 
transactions. Non-processed means that the transactions have been cancelled or 
rejected, i.e. no settlement has taken place. For future transfers, the number of 
received transactions is presented. The received transactions are subdivided into 
received via SWIFT, MQM or TES and created by TOP. The number of 
transactions received via SWIFT, MQM or TES must correspond with the number 
of output messages of type 1518, the number of transactions created by TOP. The 
number of received future transfers must correspond with the number presented in 
N2 in Table 6. 
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Table 8 Report of the Control subsystem, number of output messages versus 
number of transaction processed 



Processed transactions 




Settled transactions (Z+AA) 


Number 


Debit advices sent 


Number 


Credit advices sent 


Number 


Non-processed transactions 




Future transfers 




Received (N2) 


Number 


Via SWIFT, MQM, TES 


Number 


Message type 1518 


Number 


Created by TOP 


Number 


Message type 1535 


Number 


Cancelled (K2+L2+02+P2) 


Number 


Message types 1511, 1516, 1525 


Number 


Of which by message type 1006 (E6) 


Number 


Message type 1516 


Number 


Of which others 


Number 


Message types 1511, 1525 


Number 


Rejected (B4+B5+B17) 


Number 


Message type 1509 


Number 


Non-resident payments (B1+B2 -hB3) 


Number 


Message type 1528 


Number 


Withdrawals (B6) 


Number 


Message type 1508 


Number 
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8. CONTROL TOTALS (V) - NUMBER OF GOBS TODAY VERSUS 
PREVIOUS DAY 

8.1 Goal 

The goal of this control is to reconcile the number and amount of future transfers 
which where present at the end of the previous working day and the number and 
amount of future transfers present at the beginning of the next working day. 

8.2 Description of the control total 

The control is executed by selecting the future transfers in the actual database 
which were present or should have been present at the end of the previous working 
day. On the other hand, the future transfers which will be present at the beginning 
of the next working day are selected during the last run of the Control subsystem. 
The result of the control is presented in Table 9. 

Table 9 Control report by the Control subsystem on overnight reconciliation 





Old 


Old 


New 


New 


Future 


J2 


J2 


J2 


Number 


transfers 


K2 


K2 


N2 


Number 




L2 


L2 








M2 


M2 








Total 


Total 


Total 


Number 


Future 


J2 


Amount 


J2 


Amount 


transfers 


K2 


Amount 


N2 


Amount 




L2 


Amount 








M2 


Amount 








Total 


Amount 


Total 


Amount 



This report should be checked visually by the supervisor by checking the columns 
which present the numbers and amounts of the previous working day with the 
corresponding columns of the report of the previous working day. 
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9. CONTROL TOTALS (VI) - COMPLETENESS OF OUTPUT 
MESSAGES 

The goal of this control is to ensure that all output messages were really sent. This 
must be checked before the working day can be closed by the Periodical processing 
subsystem. In the report generated by the Control subsystem the number of output 
messages made by TOP is presented, as well as the number of output messages 
sent. 

Table 10 Report reconciliation of output messages sent 



Output messages generated 


Number 


Output messages sent 


Number 


Difference 


Number 



The supervisor has to check visually that there is no difference between these 
numbers. 

10. CONTROL LOG 

10.1 Goal 

The goal of this control is to determine that, at the moment the control run is 
executed, the actual (financial) balances have been completed in a reliable way. 

The (financial) balances comprise 

• the number of provisional orders (VOP), orders (OPD), financial transfers 
(OVB), future financial transfers (GOB), output results (URE) and output 
units (UEH); 

• the transaction amounts for the financial transfers (OVB) and future financial 
transfers (GOB). 

By ‘completed in a reliable way’ is meant that: 

• the transactions are caused only by authorized user and system functions 

• the actual balances have been reconciled with the balances at the end of the last 
working day, including the transactions which were input today,. 

10.2 Description of the control total 

All of the primary processes that take place within the Pre-processing, Transaction- 
processing and Post-processing subsystems write the actions they execute to a 
specific control log in the form of records. Such a record comprises both the old 
situation (before the processing has taken place) and the new situation (after the 
processing has taken place). For every order (OPD), transfer (OVB) and future 
transfer (GOB) the operation is registered in the form of a so-called status change. 
The starting situation is referred to by 'status old", the end situation by 'status new". 
A record in the control log contains two statuses, an event can be completely 
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described in the form of a status change. The history can be described by putting 
alle status changes after each other. 

The control measure means that with every control run of the Control subsystem, 
during the day as well as at the end of the day, the control log is counted from the 
beginning. Thus, based on the individual log records, the end balances are 
calculated which should be presented in the actual TOP database. These end 
balances are in total numbers; for the (future) financial transfers they are also in 
total amounts per status. Next, the calculated ending balances are reconciled with 
the end balances of the TOP database at that moment. 

A special control is the use of the control log for the control of the transactions 
which are input via the TOP endstation. The control log is used to check, not on an 
individual basis but in totals, that only authorized actions have taken place by users 
actually logged on. The changes in numbers and in the status of the VOPs and 
OPDs are in accordance with the control log. 



Table 1 1 Reconciliation between Control log and ending balances in the database 



Count of status 
and advices 


According 

to 


According 

to 


Numbers 


Amounts 


TA 


STO 


TAA 

TAX 


j 


n 


TU 


STO 


TAA 

TUX 


j 


n 


TI 


STO 


TAA 

TIX 


j 


n 


IN 


STO 


VOP 


j 


n 


TF 


STO 




j 


n 


TV 


STO 




J 


n 


TW 


STO 


TWX 


j 


n 


ON 


STO 


VOP 


j 


n 


ov 


STO 


OPD 


i 


n 


AT 


STO 


OPD 


j 


n 


lU 


STO 


OPD 


j 


n 


GO 


STO 


GOB 


j 


j 


WR 


STO 


OVB 


i 


j 


OP 




OVB 

GOB 


j 


j 


VD 




OVB 


j 


j 


SE 


STO 


OVB 


j 


j 


Debit advices 




DBA 


J 


n 
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described in the form of a status change. The history can be described by putting 
alle status changes after each other. 

The control measure means that with every control run of the Control subsystem, 
during the day as well as at the end of the day, the control log is counted from the 
beginning. Thus, based on the individual log records, the end balances are 
calculated which should be presented in the actual TOP database. These end 
balances are in total numbers; for the (future) financial transfers they are also in 
total amounts per status. Next, the calculated ending balances are reconciled with 
the end balances of the TOP database at that moment. 

A special control is the use of the control log for the control of the transactions 
which are input via the TOP endstation. The control log is used to check, not on an 
individual basis but in totals, that only authorized actions have taken place by users 
actually logged on. The changes in numbers and in the status of the VOPs and 
OPDs are in accordance with the control log. 



Table 1 1 Reconciliation between Control log and ending balances in the database 



Count of status 
and advices 


According 

to 


According 

to 


Numbers 


Amounts 


TA 


STO 


TAA 

TAX 


j 


n 


TU 






j 


n 


TI 






j 


n 


IN 


STO 


VOP 


j 


n 


TF 






j 


n 


TV 


STO 


VOP 

TVX 


j 


n 


TW 


STO 


TWX 


j 


n 


ON 


STO 


VOP 


j 


n 


OV 


STO 


OPD 


j 


n 


AT 


STO 


OPD 


j 


n 


lU 


STO 


OPD 


j 


n 


GO 


STO 


GOB 


j 


j 


WR 


STO 


OVB 


j 


j 


OP 


STO 


OVB 

GOB 


j 


j 


VD 


STO 




j 


j 


SE 


STO 


OVB 


j 


j 


Debit advices 




DBA 


J 


n 
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Credit advices 




CRA 


j 


n 


AN 


STO 


URE 


j 


n 


GL 


STO 


URE 


J 


n 



Explanation status 

AN = rejected by the Post-processing subsystem 
AT = rejected by the Transaction processing subsystem 

GL = delivered 

GO = approved 

IN = input by the Post-processing subsystem 
lU = withdrawal executed 

ON = cancelled by the Post-processing subsytem 
OP = (future) financial transfer cancelled 
OV = received from the Pre-processing subsystem 
SE = settled 

TA = logged on on TOP end station 

TI = input via TES 

TF = approved via TES 

TU = logged off of TES 

TV = cancelled via TES 

TW = changed via TES 

VD = processed in the cover 

WR = placed in queue 
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Abstract 

This paper deals with the management of data with different integrity grades, 
represented by marked data. In databases with explicit markings of damaged 
data, which integrity constraints apply depends on the markings of the refer- 
enced data. Correct data conform to the full set of integrity constraints while 
for data that is either damaged, but inessential or approximate, some integrity 
constraints can be relaxed. The main goal of this paper is to provide a founda- 
tion for marked databases extending the relational algebra. Results provided 
in this paper are preliminary, but they provide a pragmatic and reasonable 
approach without sacrificing the theoretical foundation. 



1 INTRODUCTION. 

In databases, integrity is traditionally viewed as Boolean: a database either 
has integrity or it does not. A database that satisfies formally specified in- 
tegrity constraints nonetheless diverges, sometimes seriously, from the real 
world. Many factors inhibit synchronization with the external world. Aside 
from the initial modeling problem, both the real world and the system evolve 
over time. Experience has shown that keeping up with the evolution is time- 
consuming and error-prone. Even worse, some systems evolve due to malicious 
activity. A malicious user, an offender, may install incorrect information in 
the database either for personal benefit, such as fraudulently acquiring goods 
or services, or with the intent of damaging the organization’s operation and 
fulfillment of its mission through disruption of its information systems [1]. 

Researchers have been investigating ways to address the incompleteness 
and inconsistency that occur in real systems. For example, relaxing integrity 
constraints improves concurrent access in distributed databases [2, 17, 10]; ex- 
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plicit management of inconsistent value extends the range of usable values for 
user applications [16, 21]; extending semantics of null values improves system 
flexibility [5]; explicit management of indeflnite and incomplete information 
gives rise to indeflnite as well as maybe query answers [7, 18]. 

This paper deals with the management of different integrity grades repre- 
sented by marked data. We consider the scheme proposed by Ammann et al. 
[1] to maintain precise information about the integrity of data using differ- 
ent color markers. They provide sophisticated algorithms to identify integrity 
deviations, mark inconsistent data, track and contain the spread of inconsis- 
tency, manage repair, and oversee a return to normal service. 

The main goal of this work is to provide a foundation for marked databases 
that is able to capture specific features of marked data and to extend the 
relational algebra in a coherent way. 

The work is presented as follows: Section 2 introduces informally the marked 
data and gives some preliminary examples; Section 3 introduces rigorously 
marked databases; Section 4 defines formally the logic framework for marked 
databases while Section 5 provides the relational algebra for marked data 
and some of its properties; Section 6 discusses related works; and Section 7 
presents conclusions. 



2 MARKED DATA OVERVIEW 

In a database with explicit markings of damaged data, each marker represents 
a different grade of data reliability for user applications. As presented in [1], 
we introduce four different markers and associate them with different colors: 

Red, [□ R]: damaged data that are always not applicable; they are provided 
by the offender and they are essential to a correct system behavior. Red 
data are similar to null values that violate database integrity constraints; 
Off- Red. [oR]: damaged data that are sometimes not applicable; they are 
provided by the offender and they are not essential. Off- Red data are similar 
to null values that do not violate database integrity constraints; 
Off-Green^ [<^G]: damaged data that are sometimes applicable; they are 
provided by the system administrator (the defender) and they are in some 
sense equivalent to the original data. OFF- Green data result in the correct 
system behavior and they can be back-up values that do not violate the 
database integrity constraints; 

Green., [^G]: undamaged data that are always applicable; specified values 
for these data always satisfy the database integrity constraints. 

There are two frameworks that are relevant to our problem: modal logic of 
necessity and possibility [11, 15, 20] and database theory of null values [5, 8, 
12]. In this work we attempt, using modal logic, to formalize the main feature 
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of marked data, that is: marked data are data that someone wants or must 
use. 

We employ the following example to illustrate our ideas. Suppose that in 
a department of some organization it is necessary to store information about 
telephone lines. Department managers want to know in which room each tele- 
phone lies and which person can answer a telephone call. Moreover j they want 
know who is responsible for the payment of each telephone bill. The relations 
in Tables 1 and 2 give the database state, where next to each values there is 
a color marker. 



PERSON 



NAME 


SURNAME 


SSN 


John:0 G 


Freeman :□ G 


234599133:0 R 


Paul:o G 


White:D G 


324499581:0 G 


Kathy:o G 


Taylor :o G 


325599685:0 G 


Lucy:D G 


Wolf:0 G 


324499581:0 R 


Sam:o R 


Huck:o R 


245569789:0 G 



TELEPHONE 



TELEPHONE 


COMPANY 


703-708-4427:0 G 


AT&T:o R 


703-708-4429:0 R 


AT&T:0 G 


703-993-1629:0 G 


Sprint:0 G 


703-993-1628:0 R 


Sprint:0 R 



Table 1 Marked Tables person and telephone 



PERSON-ROOM-TELEPHONE 



SSN 


ROOM 


TELEPHONE 


234599133:0 G 


301:0 G 


703-993-1629:0 G 


324599687:0 G 


301:o G 


703-708-4429:0 G 


325599585:0 G 


211:0 G 


703-993-1628:0 R 


324499581:0 R 


211:0 G 


703-708-4427:0 G 


245569789:0 G 


302:o G 


703-993-1629:0 R 


245569789:0 R 


212:0 R 


703-708-4429:0 R 


245569787:0 R 


303:0 G 


703-993-1628:0 R 



PAYMENT 



SSN 


TELEPHONE 


234599133:0 R 


703-708-4429:0 G 


234699133:0 G 


703-708-4428:0 R 


234699133:0 R 


703-993-1629:0 G 


245569789:0 G 


703-993-1628:0 G 



Table 2 Marked Tables person-room-telephone and pay-for 



Suppose we want the list of all known persons in relation person. If we 
consider all attributes in a relation when we display the row <Lucy:DG, 
Wolf :oG, 324499581 :□ R> the user will be informed by DR marker next 
to SSN code that it is not valid. That is, the value should not be used and it 
should not be in the query result. Instead, if the user were interested in the 
first two columns, he or she could use the data. Consider the case of selecting 
all tuples from relational table person-room-telephone where ^ ‘ROOM > 199 
and ROOM < 300 ’ ’ . The evaluation of the condition depends on both markers 
and values stored in the relational table. Again, all Red values are not valid; 
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Off -Red values must be used even though they are damaged. Consider the case 
of making the union between relational table payment and the projection of 
first and last column of relational table person-room-telephone. Many tuples 
identical in value but different for the markers can occur but users would like 
to see just one tuple, the most useful. 

Markers introduce many problems in behavior of databases. Not for all 
situations is it possible to accomplish automatic management of marked data, 
but the user must be considered. A database system for marked data should 
provide standard models of behavior leaving users free to choose one of them. 
Each model of behavior should define increasing visibility over damaged data. 



3 FOUNDATION FOR MARKED DATABASES 

We define a relational algebra over marked data giving a value-based [ 3 ] se- 
mantics for marked databases. We use the standard definition for relational 
databases distinguishing marked objects with a superscript In a marked 
tuple we consider values explicitly marked with a marker (7 G { □ G, o G, 
oR, DR} . Let us denote, for a given database scheme DB = { A2, 

. . ., An) , ^2(^1, B2, . . ., Bm) , . . Rk{Cu C2, . . Cl) }, with the set of 
all marked relations over every finite subset of attributes Udb^ with U the set 
of all corresponding relations for the same database scheme and with I{DB)^, 
X{DB) database instances. For each attribute Aj we denote with dom^{Aj) 
its marked domain and with dom{Aj) its corresponding domain. Moreover, 
for a scheme X(Ai, A2, . . we denote with dom^{X) = dom^{Ai) x 
dom^{A2) X ... X dom^{Ak) the domain of the scheme. We assume that a set 
of projection functions II = {'^Aj • dom{X) dom{Aj)\foT eachAj G A} are 
defined. We formalize the meaning of each marker with respect to the set of all 
possible marked queries and marked domains. We consider only computable 
queries as defined by Chandra [ 6 ]. 



Definition 1 (CH-computable) A computable query is a partial recursive 
function whichj given a database as input, produces as output a relation on 
the domain of the database, and satisfies a consistency criterion. 



We assign to a marked query the type: : X{DBY W. Let us 

denote with 15 x{dby ” the space of all on X{DB)^ and with 

the result of on r^. 

Queries are expressed by ordinary relational algebra expressions and for each 
Q (marked or not) there exists, at least, one expression S that represents the 
query and such that £{r) = Q(r). Let us denote with S{Q) = Xi, X2, . . 
Xk the scheme for £{r) = Q(r). The codomain im{Q) of Q{r) is a subset of 
dom{X) and we denote it dom{S{Q)). 
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Because a marked database should behave, as much as possible, as an or- 
dinary database we do not consider in the syntax of querying the markers. 

4 THE MODAL LOGIC OF COLORS 

We give the formalization of color markers using a propositional modal logic of 
colors. In Appendix 1 we introduce standard syntax and semantics of proposi- 
tional modal logic Cmc following [11]. Moreover, in Section 4.2 we introduce 
a mapping between Cmc formulas provided for marked databases and a stan- 
dard three- valued propositional logic Cmv^ [20]. Cmc and Cmv^ characterize 
relevant properties of the relational algebra for marked relations. 

We define semantics of color markers relating the membership of a value in 
a marked domain with the result of a marked query. Using modal logics we 
code each marked value as a quasi- atomic formula and each marked tuple as 
a conjunction of quasi-atomic formulas (see Appendix 1). 

Definition 2 The meaning of each C E {□ G, o G, oR, □ R}, for t^[Aj] = 
t[Aj] : Cj of a tuple t^ G is* : 

t[Aj] :OGj = VQ^ G Ux(db)c((Aj G S(Q^)) ^ t^[Aj] G dom^{7rAj{S{Q^)))) (1) 

t[Aj] : DR, = VQ^ G Ux(db)c{{Aj G S{Q^)) ^ e[Aj] ^ dom%7rAj{S{Q^)))) (2) 

t[Aj]:oGj = 3Q^gUx(db)c((A, G5(Q^))Af^[A,]Gdom^(7r^,.(5(Q^)))) (3) 

t[Aj] :oKj = 3Q^ el5ziDBr{{AjeS{Q^))Af[Aj]^dom%7TA,{S{Q^)))) (4) 

We consider each of the above definitions a value constraint where: Aj G 
tS(Q^) asserts that values in dom^{Aj) are accessible through 5(Q^); 

) G dom^{7TAj{S{Q^))) asserts that t^[Aj] = t[Aj] : Cj belongs (or not) to 
dom^{Aj). TTAj links t^[Aj] with the exact image of Q^. Each value constraint 
is a necessary condition that asserts when the marked value should be in the 
domain of the proper attribute of the query result. 

Definition 3 (Tuple constraint in Cmc ) Given t^ G defined on X{Aij 
A 2 , . . Ak) , with t^ =< t^[Ai],t^[A 2 ], . . . ,t^[Ak] >, where for t^ [A j] = t[Aj] : 
Cj is t[Aj] G dom{Aj) and Cj G { DG, oG, oR, DR} , the tuple constraint 
for t^ is in Cmc the following conjunction of quasi atomic formulas: 







= /\ <l>%t[Aj] : Cj) 

j=l...k 


(5) 


cl>'^{t[Aj] :DG) = 


Dt[Aj] 


: oG) = ot[Aj] 


(6) 




□ 

J 


cl>%t[Aj]:o-R) = 



*Here, logic connectives are used as shortcuts of the following words: {V = “for every”, 3 = 
“exists”, A = “and”,V = “or”,— “implies”} 
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X{DBY defines a set DB = | € I{DBy, defined on X} of 

tuple constraints. 

We consider those models Mdb = iy^DB, 'R'DB^ Vdb) for the set DB 
of tuple constraints where each set w C 2^ of marked relations over every 
finite subset of attributes C//)b is a possible world w G WdBj the query space 
Ux(DBy defines the accessibility relations TZdb between possible worlds and 
the assignment Vdb • Vdb ^ 2 ^^® defines in which worlds a value belongs 
to the result of a query. Under these models, Equation 1 means that green 
data are those that belongs to every accessible world; that is, they should be 
necessarily true in all accessible worlds. Equation 2 says that red data cannot 
belong to the result of a marked query, even if they are in the database, that 
is they should be necessarily false in all accessible worlds. For Off-colors the 
given semantics mean that “the marked value belongs to the result only for 
some of those marked queries where it should appear” . Equation 3 means “it 
is possible that it does appear” and Equation 4 means “it is possible that it 
does not appear” . 



4.1 Coding Marked Data into S5 

Now we show some property of the coding of X{DBY into the modal logic S5. 

Definition 4 (Color consistency) X{DBY is color consistent iff 
DB ^S5 ^ 



Definition 5 (Color satisfiability) X{DBY is color satisfiable iff 
^Msh.Msb |=S5 DB 

Proposition 1 (Color inconsistency) A set DB of tuple constraints is 
consistent in S5 (and more in general in KD) iff for any propositional variable 
t[Aj] G V does not occur for some one of the following three cases: 

(a) □Gi[Aj] AoRf[Aj] (b) oGt[Aj] ADKt[Aj] (c) BGt[Aj] ABKt[Aj] 

Proposition 2 (Color unsatisfiability) A color consistent set DB of tu- 
ple constraints is satisfiable in S5 iff any propositional variable t[Aj] G V does 
not occur in DB in one of the following conflicting forms; 

(a) □ Gt[Aj] and 0 ltt[Aj] (b) oGt[Aj] andOKtlAj] (c) □ Gt[Aj] andOKt[Aj] (7) 

Color consistency is a necessary condition to grant that any query cannot 
draw out contradictory information from a single tuple. Color satisfiability is 
a necessary and sufficient condition to grant that any query cannot draw out 
contradictory information from all values in a marked database. 
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Proposition 3 (Model existence) The model M* = V’*') where T* is 

shown in Figure 1 and is an S5 model for any color satisfiable set DB of 
tuple constraints; 



V* 




{T,X} 

{T} 



if □ t[Aj] occurs in DB 
if o t[Aj] occurs in DB 




Figure 1 The Frame T* 



It is not necessary to use the standard proof theory of S5 to verify color 
satisfiability and the problem is solvable in at most polynomial time bounded 
by the number of different values in the active domain of a marked database. 
In fact (see [22]), for ^^^^the following holds: 

Proposition 4 There is an algorithm that, given a finite Kripke model M = 
(W,7Z,V), a world w E W and a formula T G Cmc i determines whether 
M \=w ^ ti^ne 0{\M\ x 

Although we consider the modal logic S5 as the right framework where 
we can develop a complete theory of marked databases, including marked 
integrity constraints and functional dependencies, we use a mapping of S5 
into a propositional three- valued logic to show, in an effective way, how we 
verify query consistency and how we evaluate marked queries. 



4.2 Coding Formulas over Marked Data into Cj^yz 

We use the propositional three- valued logic proposed by Lukasiewics [20] 

to define an effective interpretation of logical expressions over marked values. 
Because is a particular case of propositional multi-valued logic Cj^^n , 

in Appendix 2 we introduce basic definitions of Cj^^n following [4, 14]. 

For our purpose Cj^vz is the three- valued logic where T is defined over the 
ordered set A/^ = {T, ±} and the truth tables given in Table 3. Following [20] 
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1 


T 
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T 
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T T 
implies 


T 


JL ± 1 

equivalence 


T 



Table 3 The multi-valued logic operators for 



we consider Dand o defined in respectively as: 

oA = —>A A OA = (^) 



Given a marked value d : C with C G {OGjOGjOR, □ R}, we assume that 
d belongs to a lattice A = (!>,□.□) of values and therefore there exists an 
order relation on D. An order relation is refiexive, transitive and anti- 
symmetric and is defined as: Va, 6 G D, a -<t> b = aF\b = a. We consider the 
complementary relation -<x> = {{a, b) e V x T>\ -»(a -<x> b)}. 

Let us consider for each value d G a propositional variable A eV. Given 
a valuation V of £^y3 we read: V{A) = T as the value a is always reliable; 
V{A) las the value a is sometimes reliable; V{A) = ± as the value a is never 
reliable. We define four unary operators { DG, oG, oR, DR} to represent 
in Cj^ys the characteristic function of sets of values marked homogeneously. 
Each operator is defined using the Equation 8 and is shown in Table 4. In 
Figure 2 we show the set diagram for marked values. 



X DG 



T T 

I JL 

± i. 



X DR 



T JL 

I -L 

i. T 



X oG 



T T 

I T 

JL 1 



X oR 



T ± 

I T 

J. T 



Table 4 The multi-valued interpretation of color operators in 



For op^ G {□G,oG,oR, DR} and any propositional variable t[Aj] eV it 
holds in {op^A,T) if and only if we consider models where 

t[Aj] belongs to the set of data marked with the marker Cj = op^. 

While in Cmc each tuple constraint is a conjunction of quasi-atomic formu- 
las, in we define tuple constraints as a subset of marked formulas. 




Ill 




Figure 2 The set diagram for marked value 



Definition 6 (Marked formulas) Given a domain V of values (at least the 
relations {<,<,= . >,>}v defined in V), a set V of propositional vari- 
ables (at least one for each value in V is present in V), the set of basic 
operator B — {□ G, o G, oR, □ R} U {A, V, -4, =, -i} and the set of operators 
TZ = {<,<,= . > 5 >}, marked formulas are: 

Marked Atoms: for any propositional variable A ^ V the marked atoms: 
□ GA, □ HA, o GA, □ HA, are marked formulas; 

Relational Atoms: if A, B are marked atoms and 0 G {<,<,= • >,>} 
then AQB is a marked formulas; 

Marked Formulas: if F, G are marked formulas then: ~^F, F AG, F \/ G, 
F = G, F G (F) are marked formulas; 
nothing else is a marked formula. 

Definition 7 (Relational operators) Given two marked atoms A^ = op^j^A, 
B^ = op%B, where G {□ G, o G, o R, □ R} and A,B£V correspond 

to a,b £ V, for any valuation V :V ^ {T, I, J_} and any 0 G {<, <,= •>, >} 
the valuation (A^QB^)'^ = is: 

" ^ {T}) .h.„ ± 

.1.. if A‘" A 8“^ = A' th.n ca>. of A‘^ A B'^ 

[V(A) = V(B) = )] . {A'^QB’^)'^ = i 
[V(A) = V{B) / )] , (A^QB^)^ = (aQcs b)'^ 

• ndCaa* 

• 1 .. {A^SB^)^ = {aScs 

where {aQce standard valuation of 0 over V. 

The expression {aQcs [^(^) = ^ should be substituted 

by -»(a0£s h)^ if 0 is not defined for all d G X>. The symbol Qcb represents 
the complement of &Cb • We summarize Definition 7 in Table 5. We give 
notions of satisfiability and validity for marked formulas extending those for 

Definition 8 (Color satisfiability) Given a marked formula F, a set U C 
{T,^,±} and a valuation V over propositional variables of F, (F,U) is satis- 
fiable (color satisfiable), denoted as 1=^ {F,U), iff (F)^ G U. 
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□ G 


oG 


oR 


□ R 


□ G 


©£5 


0£b 


± 


i. 


oG 


“•©^B 


^Cb 


1 


± 


oR 


1 


1 


^Cb 


“>©/:s 


□ R 




± 


^Cb 


©£b 



Table 5 The evaluation of ©£^^3 ^ {<?<> = • on marked data 



Definition 9 (Color validity) Given a marked formula F and a set U C 
{F,U) is valid (color valid), denoted as |=c {F,U), i#VV, \=Y F. 

Note that correspondence between S5 and Cj^\;3 holds only considering 
satisfiability and validity when U = {T} (see [20]). Given a marked formula 
F because the interpretation (F)^ returns a value v € {T, i.} we can define 
at least two different further notions of satisfiability and validity as follows: 

Damaged Data Intolerant interpretation (F)^ = T ; 

Damaged Data Tolerant interpretation (F)^ G 

We denote with F, |=j F and F, [=| F respectively satisfiability 
and validity in damaged data intolerant and damaged data tolerant interpre- 
tations. 

Let us consider a lattice A = (2>, U,n), the order relation {(u, i>) G 

'DxD|an6 = a} and the coding of marked values into marked formulas. 

Proposition 5 Given a^,b^ E = a : Ca, b^ = b : Cb where a,b E T> 

and Ca,Cb € { OG, oG, oR, DR} , the relations -<x>^ defined as: 

= {{a^,b^)eV^xV^\[=c{{A<^<B^),{T})} (9) 

is an order relation over the set of marked values. 

The Table 6 shows that Xpc is an order relation built as the disjunct union 
of two different order relations. 

Definition 10 (Tuple constraint in ) Given t^ G defined on X{Ai, 

A 2 , . . ., Ak) , with t^ =< ^^[^ 2 ], . . . ,t^[Ak] >, where for each t^[Aj] = 

t[Aj] : Cj it holds t[Aj] G dom{Aj) and Cj G {□ G, o G, oR, □ R}, the tuple 
constraint for t^ is in the following conjunction of marked atoms; 

nx,n = /\ <t>^t[Aj]:Cj) 



( 10 ) 
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□ G 


oG 


oR 


□ R 


□ G 


div 




X 


X 


oG 






1 


X 


oR 


X 


1 




— IX-p 


□ R 


X 


X 


dlTt 


div 



Table 6 The partial order relation on marked data 



<l>^it[Aj]:nG) = oG) = oGt[A,] , . 

<l>'={t[Aj]:aTL) = aRt[Aj] 4>%t[A j] : oR) = oRt[4] ^ ’ 

Note that given defined on ^(^i, A 2 , • • -4*) it holds: 

1=1^ r{X,r^) = Mss Ns5 r(X,n and |=f ^ Mss Nss r(X,t<=). 



4.3 Evaluating Query Consistency in 

Let us consider € r*= defined on X(Ai, A 2 , . . Ak) . Suppose that we want 
know if is in the result of a specific such that Z = (S(Q‘)f)X) ^ 0 where 
Z = {Ai^,Ai^,...,Ai^} and Z C X. In general for it holds; rp'^{X,t‘) A 
ctqc where il>'^{X, is defined in Equation 5 and ctqc (f^) depends from 
evaluated on ip‘^{X,t‘^) AcxQc{t‘^) represents a query consistency constraint, 
that is the conjunction of two necessary constraints: the former xj)’^{X,t‘^) = 
(A =1 k • ^j)) account of color consistency and the latter 

CTQc{t^) talces account of the implicitly defined constraint that verifies for 
each query the existence of values to display. 

Definition 11 Given € r® defined on X{A\, A 2 , A*,) , with t^ -=< 

t'^[Ai],t'^[A 2 ], ■ ■ ■ ,t‘^[Ak] >, where for each f^[Aj] = t[Aj] : Cj is t[Aj] 6 
damiAj) andCj e {□G,oG,oR,DR} and Q<= such that Z = {S{Q'=)nX) 

0 where Z = {Aj, ,Ai^,..., Ai^}, the constraint ctqc (t‘^) is; 

= ( A 

Definition 12 Given t‘^ € r® defined on X{A\, A 2 , Ak) , with =< 
t‘^[Ai],t‘^[j42], . . . >, and such that Z = («S(Q‘^) D X) ^ 0, Z = 

{Ai, ,Ai^,...,Ai^} and X\ S(Q‘) = {Ai, ,Ai„..., }, the query con- 

sistency constraint is in : 

T%t‘^[Z])^ A r"(t"[X \ (12) 

r<=(t'[z])^ = A = (13) 
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= l...lm— A:| 
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□ G) 


= aGt[Ai.] 


r{t[Ai.]:<>G) = oGt[Ai,] 




nt[Ai,] 


□ R) 


= DRil^l,,] 


<i>'= {t[Ai. ] : 0 R) = 0 Tit[Ai. ] 


\^LOJ 




□ G) 


= □G<[Ai,]Ai[AiJ 


= DGt[Ai,] 




YitUii] 


□ R) 


= DRtlAiJ 


= 1 


fl6l 


Y{t[Ai.] 


oG) 


= 0 Gf [ Ai ] A < [Ajj. ] 


= t[Ai,] 




YitUi/] 


oR) 


= 0Rt[Ai,]At[Ai.] 


— ] A t\Ai- ] 





Values in a marked tuple can be displayed by a query if at least the subset 
belonging to the query scheme is not contradictory (see the example below). 

Definition 13 (Query consistency) defined on X{Ai, A 2 j . . Ak) is 
query consistent in with respect to iff 

Damaged Data Intolerant interpretation 

Damaged Data Tolerant interpretation [=1^ T^{t^[S{Q^) flX])^ 

We want manage also situations where use r needs Red values. Therefore, 
we introduce a constraint D X])*^, dual with respect to Equation 

13 and a notion of weak query consistency. 





)nx]Y 


= A Yit[AiA : 


CiA 




Y{t[Ai^] 


□ G) = 


aGt[Ai.]A^t[Ai.] 




± 


YitUiA 


□ R) = 


□ Rf[Ai,] A-t[Ai,] 


= 




Y{t[AiA 


oG) = 


oGt[AiJ A ~'t[Ai.] 


= 


A t[Ai.\ 


Y{t[AiA 


oR) = 


oRi[AiJ A 


~ 





Definition 14 (Weak query consistency) t^ defined on X{Ai, A 2 , • . 
Ak) is weakly query consistent in £^y3 with respect to Qf , iff 



Damaged Data Intolerant interpretation |=^ r^(t^[<S(Q^) flX])*^ 
Damaged Data Tolerant interpretation |=|:^ r^(t^[<S(Q^) flX])*^ 

5 RELATIONAL ALGEBRA FOR MARKED DATA 

The semantics given so far state, for a marked tuple, how each of its marked 
values individually constraints the behavior of a marked database. From the 
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given semantics of color markers follow sufficient conditions to test when a 
marked tuple belongs to a marked query. This test allows us to build practical 
marked database systems on top of traditional systems. 



5.1 Conditional Expressions 

Definition 15 (Propositional formulas over marked databases) Letr^ 
defined over X(Ai, A 2 , Ak) , a propositional formula !F over X is: 

Atoms: AiQAj or AiQc are propositional formulas over X; where A{,Aj G 
X, c G dom{X) (then c is not a marked datum) and © G {<,<,= •>,>}; 
Formulas: if T, £ are formulas on X then: T f\£, T \t £, {T) are 

formulas on X; 

nothing else is a formula on X . 

A propositional formula T over X associates a value in {T,^,±}with each 
marked tuple. The effective evaluation is obtained for each tuple t^ G 
coding .F as a marked formula. 

Definition 16 (Propositional formulas coding) Given t^ G defined 
on X{Ai, A 2 , Ak) , with t^ =< t^[Ai], t^[A 2 ], ..., t^[Ak] >, where for 
each t^[Aj] = t[Aj] : Cj is t[Aj] G dom{Aj) and Cj G {□ G, o G, o R, □ R}; 
given a propositional formula T over X , the coding function that maps 

T into a marked formula is: 

where Ai.Aj e X, c e dom{X), (j)^{t[Ak\ : Ck) is in Equation 11, © G {<, < 
,= .>,>} is in Definition 1 and {A, V, =, ~^}are in Table 3. 

We consider in those valuations V such that: 1=^ {(t)^{t[Ak] : 

Ca:),{T}) and given U C {T,^,±}, we denote satisfiability of F as: F |=c 
{F,U) ^ {F,U). Given F G defined on X and given F = (F)^^‘"’^^ 

(more in general, any F G ), we have: 

Damaged Data Intolerant interpretation F F iff ^ F 
Damaged Data Tolerant interpretation F |=f F iff ^ F 

Each of them represents a different user interpretation of the reliability of 
marked data. Moreover, for each query the user can choose between different 
notions of query consistency. 
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5.2 Algebraic Operators 

Because a single tuple t defined on R{Bi , B 2 , . . Bm) in a standard database 
can be obtained from 4^ different marked tuples, a grouping operator for 
marked tuples that differ only for color markers is necessary. Moreover, each 
marked tuple must satisfy a query consistency constraint to belong to the 
result of a query and this gives rise to different query tests 7^^ . 

The pre-result is a wide set of marked tuples from which through grouping 
operation and query test all unnecessary and contradictory tuples are removed. 
The high-quality technology used to optimize standard queries suggests us to 
calculate the per-result, as much as possible, using standard systems. We con- 
sider for each algebraic operator {fl,U, \, x,M,7Tx,cr^}a different pre-result 
operator {U,D,\, x,pre- txi^,pre- 7Tx,pre-cr5r}and different query tests 
7^^ where I G {t,t} ® ^ (W f^ormal^ (~) weak query consis- 

tency). The grouping operator Q^\s unique, even if it depends on the chosen 
interpretation through T^aj) • 

Definition 17 (The query test ) Given t^ G defined on X{A\, 
A 2 , . . Ak) and a set of attributes S such that Z = {SnX) 0, 7^^ 5) 

is defined as shown in Table 7. 



'TC 

^a,/) 




t 




t 


weak:~ 


N 




N 




normal: 


N 




N 





Table 7 The Table of 7^^ 



Note that 7^^ is a meta logic operator that returns values in {T, i_}. 

Proposition 6 Given t^ 6 defined on X{Ai, A 2 , . . Ak) and given a 
specific such that Z = (5(Q^) flX) ^ 0, t^ is valid with respect to 7^^ iff 
t^ has over attributes in Z no conflicting forms (see Equation 7) and case of: 

a = ★, / = J , t^ has all green and/or Off-green data (3 cases); 
a = ★, 7 = t , t^ has all green and/or Off-green and/or Off-red data (7 cases); 
a =^^,7 = J has all red and/or Off-red data (3 cases); 
a =~,7 = f , has all red and/or Off-red and/or Off-green data (7 cases); 



Example 1 Let us consider relations in Table 2 and a query that projects the 
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first and the last column of person-room-telephone. In Table 11 (appearing 
in Appendix 3) we show the query result with respect to weak and normal 
query consistency. 

Definition 18 (The grouping operator Q^) Given defined over X{Ai, 
A2 j ^A;) such that the corresponding relation t is a single tuple t, the 
grouping operator (t^) is: 

if 3t^ E s.t e t^{t'^ # t'^ t^) 

= then g^t^) = t^ 

else g^ (t*^) = t with all markers Cj = oG {j = 1 .. .k) 
where ^ t^ means that t'^,t^ differ at least for a marker and t^ is: 

Caseof < (^'^ A, A), (^^ A, A) > 

-j- (mtm(OG,f'^) < rmm(DG,f^))V 

’ ’ {(num(OG,t*^) = num(OG,t^)) A (num(oG^t'^) < num(oG,t^))) 

<T,1>: _L 

<±,T>: T 

< ±,± > : J. 

EndCase 

where num{C,t^) counts the number of markers C in t^. 

Note that, because of 7^^,/) ’ depend on which interpretation 

the user adopts. The grouping of a marked relation is obtained by applying 
to each group of tuples identical in value but different in markers. Note 
that for a group formed by only one tuple g^ returns the tuple itself. 

The set of basic pre-operators for marked relations, x , U, D, \and ttx coin- 
cides with standard operators of relational algebra, considering marked values 
as atomic values. Let us introduce a standard renaming operator 
that renames attributes X of in attributes Y and the operator Ai?(r^). 

Definition 19 (Three- valued satisfiable set) Given defined over X{Ai, 
A2, . . ., Ak) and F G , the three valued satisfiable set Air(r^) is: 

AF{r^) = {Fer^\F |=/ F} 

Ap IS useful to evaluate • In fact, each expression showed in Table 7 
is calculated as Aj-c {x,y)i where X,Y are sets of attributes. 

Definition 20 (Marked selection a^) Given defined over X{A\, A2, 

. . ., Ak) and T defined over OCX, the pre-selection pre-cjjr is: 
pre-a^r-) = {t^ e {t‘^,X,C) and |=/ 

pre-a5.(r‘=) = Ar^c (x,c)(r‘=) n A(^)(.c,x)(r'=) 
and the marked selection is: 

ipre-a^r-)) = Q<= (A-^c (x,c)(r‘=) n A(^)«c,;„(r‘=)) 

Example 2 (Marked selection cr^) Let us consider relation view defined 
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on X = {ssN, BILL, last-bill} and showed in Table 13. where T =BILL > 
LAST-BILL under different interpretations is shown in Table 12 and the correspond- 
ing query constraints are shown in Table 14. 

Definition 21 (Marked projection 7Ty) Given defined over X{Ai, A 2 , 
. . Ak) and Y C X, the marked projection 7Ty is: 

ir^y(r-) = g- (Ar^c (y,y)(iry(r-))) = {t<^[Y] \rer- and Y, Y)} 

Definition 22 (Marked natural Join tX^) Given defined respec- 

tively on P{Ai, A 2 , . . Am) and Q{Bi, B 2 , . . Bn) , such that POQ = X, 
the natural pre-Join pre- is: 

p‘=pre->d^ 9*= = 7r/.uo(pre-or^^y((p‘=xpxe->y(g‘'))U(/XA-^y(p“)xq‘=))) where 
PUQ is the scheme for pre- XI, F <^{P\JQ), X = Y stands for the expres- 
sion that equates attributes X and Y. The marked natural Join is: 
pC tx)^ q‘= = g<= (At^c iPuQ,PuQ)ip‘'pre- 9*")) = T^pugiP^'pre- X^ g‘=) 

Note that, following Table 5, in case of 0 = “ =" and damaged data tolerant 
interpretation the expression X = Y equates attributes that are different in 
value. Therefore, the projection ttpuq can return 2l^l different tuples for each 
t^ e Y>re-a%;^y{{p^ x fjix^Y{q''))U{/jLx^Y{p'') x q^)) that has different values 
between the equate attributes. The union of two products is necessary to 
manage this case (see the following example). 

Example 3 (Marked natural Join Let us consider relations payment 
and TELEPHONE of Tables 1 and 2. pre- with respect to normal query con- 
sistency and different interpretations is shown in Tables 15 and 16. is 
shown in Tables 17 and 18. 

Definition 23 (Marked 0-Join M^) Given q^ defined respectively on 
P(Ai, A 2 , . . ., Am) and Q{B\, B 2 , . . ., Bn) , such that PflQ = 0 and given 
0 defined over C C (P U Q), the Q -pre- Join pre- is: 
p^pre- q^ = pre-aQ{p^ x q^) 

where PUQ is the smeme for pre- and the marked Q-Join is: 
p<= X^ = g<= (Ar^; ,, {PuQ,PuQ)(p''pre- X^ g®)) = Tr^.^gCp'^pre- X^ g<=) 

Note that following Table 5 and damaged data tolerant interpretation the 
expression 0 holds between not correctly related values. Therefore, the projec- 
tion TTpuQ returns tuples that never should be considered in a standard query 
(see the following example). 

Example 4 (Marked 0-Join Let us consider relations telephone and 
PAYMENT of Tables 1 and 2 and dot notation as a renaming for attribute tele- 
phone of both tables. Consider 0 = TELEPHONE. TELEPHONE < PAYMENT. TELEPHONE. 
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pre- M0 with respect to normal query consistency and different interpretations is 
shown in Tables 19 and 20. lxi@ is shown in Tables 21 and 22. 

For all other algebra operators {x^, U^, D^, \^} their definition is: 

Definition 24 Given p^, defined over suitable schemes P{Ai, A 2 , . . 
Am) and Q{Bi, B 2 , . . Bn) each op^ G {x"", U"", 0"", is: 

p^op‘^q^ = (7TpuQ(p‘=pre-op‘^g‘=)) = Tr’pyjqip'^pre-op’^q’^) 

All operators of the algebra for marked relations are defined using standard 
U,n, x,\,7Tx and two extra operators: A/r and . That is: a management 
system for marked relations can he built on top of a standard database man- 
agement system augmented with three-valued propositional formulas. 

In [4] it is shown that satisfiability of formulas in conjunctive normal form 
for three-valued propositional logics is decidable in polynomial time, under 
the assumption of a fixed theory and considering as input the formula. 

Proposition 7 The evaluation problem for a query defined using operators 
provided for the algebra of marked relations belongs to the same computational 
complexity class of the evaluation problem for standard relational algebra. 



6 RELATED WORK 

Two main research areas are related to the problem addressed in this work, 
namely Generalized Annotated Logic Programming and Null Values Theory. 

The theory of Generalized Annotated Logic Programming (GAP) by Kifer 
and Subrhaniam [16] has been proposed to deal with inconsistencies in knowl- 
edge bases. The utility of annoted logics for reasoning with inconsistency and 
for programming expert systems has been also argued [16]. The expressive 
power of GAP is very rich and it subsumes also some temporal logic pro- 
gramming, but GAP cannot be adequately implemented [16]. As remarked by 
Lu [19] GAP theory is strongly related with a first-order logic of signed formu- 
las and with the labeled deductive system proposed by Gabbay [9]. All of these 
researches involve studies of multi-valued, standard and modal logics [19]. Our 
solution aims to use only formulas expressed by suitable propositional logics 
instead of more expressive predicate logics. We lack in expressive power but 
we gain in tractability, as is shown in Section 5. In any case we do not lack in 
expressive power about user expectations. The main difference between our 
proposal and one provided by a logic programming with signs and annotations 
is that we consider signed constants instead of signed formulas. A mapping of 
our proposal in GAP is possible but it does not provide either new insights 
in the problem or better implementation of our solution. 

The null values theory (see [3, 5]) introduces a set of explicit markers to 
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manage unknown values. The meaning of markers for null values is shown 
in Table 8 where: ± denotes a non-existent value; D denotes an attribute 
domain and it is used to say that a value exists but it is unknown; D \J 1. 
denotes an unknown value that lies either on the domain or _L; 14 U _L says 
that the actual value can lie either in the set 14 C jD or J_; and 14 says that 
the value is in the set Vx C D and {u} denotes the specified value. The null 
values theory lacks one of the basic requirements for databases with explicit 
markings of damaged data, that is: in any case a single value is provided and 
stored in the database. We have considered two possible mappings of color 
markers into null values markers, as shown in Table 9. Nevertheless, these 
mappings cannot represent that a value is always provided by the defender or 
by the offender. Moreover, standard rules for the management of null values 
are too restrictive (see [5]) and therefore the null values theory cannot give a 
satisfactory solution for the management of databases with explicit markings 
of damaged data. 



Null Marker 


Semantics 


Null Marker 


Semantics 


pLmark 


± 


ex-mark 


D 


ma-mark 


DU± 


pm-mark(Vx) 


VxU± 


pa-mark (Vx) 




va-mark(v) 


{v} 



Table 8 Null Values Markers 





I mapping 




II 


mapping 


□ R 




pLmark 


□ R 


»-> 


pLmark 


oR 




ex-mark 


oR 




ma-mark 


oG 


l-> 


pa-mark (Vx) 


oG 




pm-mark(Vx) 


□ G 




va-mark(v) 


□ G 




va-mark(v) 



Table 9 Mappings of marked data using null values 



7 CONCLUSIONS 

Maintaining absolute integrity defies the best efforts of system designers, and 
so approaches based on a black-and-white view of integrity necessarily invite 
problems. Fortunately, researchers are developing techniques to recognize and 
even exploit deviations in integrity. 
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Our proposal aims to deal with the general problem of relaxing integrity 
constraints in databases following a tractable approach. We use the large set of 
developed theory for nonstandard logics to cut out a small, reasonable frame- 
work for solving our problem in an effective way. Nevertheless, our proposal 
is not the final solution to the problem; it lacks the complete automation of 
damaged data management. Moreover, our framework is rigidly defined for 
a markers scheme of four colors and cannot be extended to arbitrary sets of 
markers. To obtain a general solution, further investigations in the addressed 
areas of multivalued and signed logics should be done. 

Results provided in this paper can be considered as preliminary for a com- 
plete foundation of marked databases. With our proposal we aim to have 
shown a pragmatic and reasonable approach for dealing with, in general, an 
intractable problem, without lacking in theoretical foundation. 

In future works we aim to study other relevant problems for database with 
explicit markings of damaged data, as the management of functional depen- 
dencies, integrity constraints, and query optimization. 
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APPENDIX 1 MODAL LOGIC 

The language Cmc is defined extending propositional logic with two additional sym- 
bols: 0 , 0 . Given a set V of propositional variables, Cmc formulas are defined as: 
Atoms: every A eV and the two constants T, ± are atomic propositional formulas 
of Cm.c j 

Formulas: if A, B are formulas then: -lA, At\B^ Ay B^ A B, (A), □ A, o A are 
formulas of Cmc ; 
nothing else is a formula of Cmc • 

Definition 25 (Quasi-atomic formulas) If A G PU{T,±} is an atomic formula 
then DA, oA, □ -iA, o-iA are quasi atomic formulas of Cmc • 

The two symbols □ , o can be read in different ways, common readings of them 
are: necessity and possibility: DA = “it is necessary that A is true”, oA = “it is 
possible that A is true” ; always and sometime: □ A = “it is always A true” , o A = 
“it is sometime A true” ; knowledge and ignorance: □ A = “it is known that A is 
true” , o A = “it is ignored A is true” . 

The standard semantics of Cmc formulas is the possible world semantics intro- 
duced by Kripke (see [15, 22] for many historical notes). A model for a set T of 
Cmc formulas is a pair M = {T , V), where ^ is a frame and V is an assignment for 
propositional variables V oiT. More precisely, a frame is a pair T = (W, IZ)^ where 
W is a non-empty set of objects, the possible worlds, and 7?. C W x W is a binary 
relation, the accessibility relation. An assignment V : P 2^ is a mapping from 
propositional variables in T to subsets of worlds in VV. 

Definition 26 (a formula holds in a world w) Given an Cmc formula A and 
a model M = (W, P, V), A holds in a world w G W, denoted as M [=ti; A, iff: 

M |=z,^ iffw^ V(A) Ai \=w A A B iffM |=tu A and M. |=tx; B 

M t=ti; T M Ay B iffM \=w A or M \=w B 

M -L M |=u; A-^ B iffM |=tx; A implies M l=w B 

M |=u; -^A iff M A M |=iu n A iffyt wlZt implies M [=tx; A 

M t=iu (A) iff M [=iu A M |=n; o A iff3t^ W, wlZt and M |=iy A 

Definition 27 (£m£ satisfiability) Given an Cmc formula A and a model M = 
(W,P, V), A is satisfiable in M, denoted as M A, iffyw G W, M |=tu A 

Definition 28 (Cmc validity) Given an Cmc formula A and a frame T, A is 
valid in P, denoted as T A, iffyM = (P, V), Ad \= A 

Given a subset Tax of Cmc formulas. Tax defines a modal logic if: (i) Tax includes 
all tautologies; (ii) Tax is closed under modus ponens and uniform substitution. 

Definition 29 (Theorems) Given a logic Tax and an Cmc formula A, A is a 
theorem of Tax , denoted as ^ ^ Tax • 

Definition 30 (Deductibility) Given a logic Tax and a set FUA of Cmc formulas, 
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A is deducible from T in Tax , denoted as T \~Tax there exists B\, B 2 , . . 

Bn ^ r such that: ^ (^2 — ^ (• • • — ^ (-®n — ^ -4) • • •)) 

Definition 31 (Consistency) Given a logic Tax and an Cmc formula A (a set of 
formulas T), A (V) is consistent respect Tax , denoted as 4(F) ^Tax 'ts not 

true r ^Tax 

Much of work in modal logic has concerned proof theory, soundness, completeness 
and decidability (see again [22, 11]). Prom others two fundamental problems are 
distinguished for a given modal logic Tax and a class of frame C: 

soundness 4 G Tax implies \= A 

completeness G [= 4 implies 4 G Tax 

When for a set of frames C and for a logic Tax , both soundness and completeness 
hold, it is said that Tax is determined by the set of frames C. 

A modal logic Tax can be syntacticly characterized using axioms or schemata. A 
schema is a collection of formulas included in Tax all having a common syntactic 
form. Describing a modal logic using specific axioms imposes specific constraints 
on the accessibility relation IZ. The most studied modal logics are normal modal 
logics which are those that include the schema: K: □ (4 — > .B) — > (□ 4 — □ B) and 
are close under the rule of necessitation: if 4 G Tax then □ 4 G Tax Each normal 
logic can be denoted using the notation: KAX\ . . . 4A’n, where each AXi denotes a 
specific axiom. In the Table 10 we report some well-known axiom. 



Name 

D 


Axiom 
0 A 0 A 


Constraint 
Serial: Viu3t xvIZl 


T 


a A -¥ A 


Reflexive: Vro wIZw 


4 


□ A □ A 


Transitive: Vti;VtVr w'R.t A t'R.z — > ■w'R.z 


B 


i 

□ 

0 


Symmetric: ViuVt wTZt t'R.-w 


5 


0 A -+ □ 0 A 


Euclidean: VtuVtVz xoTlt A rvl^z ->■ tUz 



Table 10 The schemata for Modal Logics 



The validity problem is decidable for most of modal logics, that is: for a given 
modal logic Tax there is an algorithm for determining for each modal formula 4 
whether or not 4 G Tax • In general, the computational complexity analysis of the 
validity problem for modal logics gives rise to intractable results, that is: the problem 
belongs to NP or expt-time classes [22]. 

For this work the main result follows from the coding of color markers into propo- 
sitional modal formulas of a specific normal modal logic: i.e. AT45 = 55. 55 is one 
well known modal logics called logic of necessity because it characterizes the notion 
of logical necessity. A logical necessary truth is one which is true in all possible 
worlds as for Green and Red data. Two important results hold for 55 (see [22]): 

1. S5 is determined by the class of recursive, symmetric and transitive frames 

2. The validity problem for S5 is decidable and NP-complete. 
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APPENDIX 2 MULTI-VALUED LOGIC 

The language of defined as for classical propositional logic. Given a set V 

of propositional variables, Cj^^n formulas are defined as follows: 

Atoms: every A Q V and the two constants T, J_ are atomic propositional formulas 
of Cj^yN ; 

Formulas: if A, B are formulas then: -lA, AAB^AVB,A—^Bj (A), are formulas 
of Cj^^N ; 

nothing else is a formula of Cj^^n . 

The semantics of a propositional multi-valued logic depends on the set of val- 
ues chosen for defining the valuation function. Formally, a model for a set T of 
formulas is a pair M = V) where is a skeleton for logical connec- 
tives of Cj^yN and V : P jV is a valuation function that maps propositional 
variables onto M. More precisely, J\f represents the domain of truth values and 
A = {A(A), A(V), A(— >•), A(-»)} is a set of mapping on M of suitable arity associ- 
ated with each logical connective* . Given a model M. = (A/^, A, V) an interpretation 
( )^ is defined as the homomorphic extension of V that maps Cj^^n formulas onto 
J\f. That is, for any logical connective op{ ri, X2, . . Xk ) of arity k it holds: 
(op(xi,X2, . . . ,Xk))'^ = A(op){{x)i , (x)2 , . . • , ) and for any propositional vari- 

able AeV it holds: (A)^ = V(A). 

Definition 32 (Cj^^n satisfiability) Given F G Cj^^n , U C Af and M = 
(F,V), (F,U) is satisfiable in M, denoted as M. (F,U), iff (F)^ G U. 

Definition 33 validity) Given F G Cj^^n , U C Af and F, (F, U) is valid 

in T, denoted as T {F,U), iff^M = (JP, V), M F. 

Any valid formula is a tautology. Moreover, because for a given skeleton 

F is also given, satisfiability and validity are usually related only to the valuation 
function V and not to the whole model Ai = (F, V). In many cases between ele- 
ments of Af the true value is distinguished and therefor satisfiability and validity 
are considered with U = {true}. Depending on Af we have different multi-valued 
logics, for example if Af = [true, false} we have classical propositional logic. That 
is, propositional multi-valued logics generalize classical propositional logic. If Af is 
finite and its cardinality is small (three or four) each A[op) G A is defined by a 
truth table, as we have done. Other possible multi-valued logics come from choos- 
ing Af infinite or finite and unordered or ordered. In any case when A/* is a lattice 
A = (A', n,U) each logical connective is defined in terms of join (U) and meet (fl) 
operations over the lattice elements Af (see [13] for further details). 



*More in general, a multi-valued logic can extend the standard set of logical con- 
nectives with new ones, providing for them specific maps A{op) : Af. 
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APPENDIX 3 TABLES 



<~.n 



SSN 


TELEPHONE 


245569789:0 R 


703-708-4429:D R 


245569787:D R 


703-993-1628:0 R 



SSN 


TELEPHONE 


234599133:0 G 


703-993-1629:0 G 


324599587:0 G 


703-708-4429:0 G 



SSN 


TELEPHONE 


325599585:0 G 


703-993-1628:0 R 


324499581:0 R 


703-708-4427:0 G 


245569789:0 G 


703-993-1629:0 R 


245569789:0 R 


703-708-4429:0 R 


245569787:0 R 


703-993-1628:0 R 




.t) 


(* 


SSN 


TELEPHONE 


234599133:0 G 


703-993-1629:0 G 


324599587:0 G 


703-708-4429:0 G 


324499581:0 R 


703-708-4427:0 G 


245569789:0 G 


703-993-1629:0 R 



Table 11 The projection of a marked relation 
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324499581:0 G 


1508:0 G 


1808:0 G 


324499581:0 G 


150$:Q G 


180$:O G 


325599585:0 G 


1508:0 G 


2008 :o G 


325599585:0 G 


150$:O G 


200$:o G 


325599585:0 R 


1508:0 R 


2008:O R 








245569789:0 R 


1508:0 G 


1008:o R 
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325599585:0 R 


2508:o G 


3008:O G 


SSN 


BILL 


LAST-BILL 


325599585:0 R 


2508:0 R 


1508:O G 


325599585:0 R 


150$:O R 


200$:O R 


325599585:0 R 


1508:0 R 


2008:o R 


245569789:0 R 


300$:0 R 


4008:0 R 


245569789:0 R 


1508:0 G 


1008:o R 














245569789:0 R 


3008:0 R 


4008:0 R 



Table 12 The result of where T =BILL > LAST-BILL 
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VIEW 



SSN 


BILL 


LAST-BILL 


234599133:D G 


200S:0 G 


250S:O R 


324499581:0 G 


150$:O G 


180$:O G 


324499581:D R 


200$:O G 


400S:O R 


324499681:D R 


400S:O G 


100S:0 R 


324499581:0 G 


300S:O R 


400S:0 R 


325599585:0 R 


250$:o G 


300S:o G 


325599585:0 G 


150$:O G 


200S:O G 


325599585:0 R 


250S-.O R 


150$:O G 


325599585:0 R 


150$:O R 


200S:O R 


245569789:0 R 


150S:O G 


100$:O R 


245569789:0 R 


300$ :0 R 


400S:O R 



V.‘=(X,t‘=) 

□ G234699133 A □ G200S A o R260$ 



o G324499581 A □ G150$ A □ G180$ 



□ R324499581 A □ G200$ A □ R400$ 



□ R324499581 A o G400S A □ RIOOS 



□ G324499581 A o R300$ A □ R400S 



□ R325599585 A o G260S A o G300S 



□ G325599585 A □ G150S A o G200S 



□ R325599686 A o R260S A o G160S 



o R325599585 A « R1608 A o R200S 



O R245569789 A o G150$ A o RIOOS 



o R245569789 A □ R300S A □ R400S 



Table 13 Relation view and the corresponding tuple constraints 







O G234599133 A O G200S A -<250$ A 250$ 


X A X A -250$ 


324499581 A O G150S A O G180S 


-324499581 A 324499581 A X A X 


X A O 0200$ A X 


□ R324499581 A X A □ R400$ 


X A 400$ A X 


□ R324499581 A -400$ A 400$ A □ RIOOS 


O G324499581 A -300$ A 300$ A X 


X A -300$ A □ R400S 


X A 250$ A 300$ 


□ R325599585 A -250$ A 250$ A -300$ A 300$ 


O G325S99585 A O G150S A 200$ 


X A X A -200$ A 200$ 


X A -250$ A 250$ A 150$ 


□ R326599585 A -250$ A -150$ A 150$ 


-325599585 A 325599585 A -150$ A 150$ A -200$ A 200$ 


-325599585 A -150$ A -200$ 


-245569789 A 245569789 A 150$ A -100$ A 100$ 


-245569789 A -150$ A 150$ A -100$ 


-245569789 A 245569789 A X A X 


-245569789 A □ R300S A □ R400S 



Table 14 The query constraints for relation view 



SSN 


TELEPHONE 


COMPANY 


234599133:0 R 


703-993-1629:0 G 


Sprint:0 G 


234599133:0 R 


703-993-1629:0 G 


Sprint:0 G 



Table 15 The result of pre- under if interpretation 



SSN 


TELEPHONE 


COMPANY 


234599133:0 G 


703-708-4428:0 R 


AT&T:o R 


234699133:a R 


703-993-1629:0 G 


AT&T:0 G 


234599133:0 R 


703-993-1629:0 G 


Sprint:0 G 


245569789:0 G 


703-993-1628:0 G 


AT&T:0 G 


234599133:0 G 


703-708-4427:0 G 


AT&T:o R 


234599133:0 R 


703-708-4429:0 R 


AT&T:0 G 


234599133:0 R 


703-993-1629:0 G 


Sprint:0 G 


245569789:0 G 


703-708-4429:0 R 


AT&T:0 G 



Table 16 The result of pre- under f interpretation 



SSN 


TELEPHONE 


COMPANY 


0 


0 


0 



Table 17 The result of under J interpretation 



SSN 


TELEPHONE 


COMPANY 


234599133:0 G 


703-708-4428:0 R 


AT&:T:0 R 


245569789:0 G 


703-993-1628:0 G 


AT&T:0 G 


234599133:0 G 


703-708-4427:0 G 


AT&T:0 R 


245569789:0 G 


703-708-4429:0 R 


AT&T:0 G 



Table 18 The result of under f interpretation 



SSN 


TELEPHONE 


PAYMENT 


COMPANY 


234599133:0 R 


703-708-4429:0 G 


703-993-1629:0 G 


Sprint:0 G 


234599133:0 R 


703-993-1629:0 G 


703-993-1629:0 G 


Sprint : O G 


245569789:0 G 


703-993-1628:0 G 


703-993-1629:0 G 


Sprint:0 G 



Table 19 The result of pre- under J interpretation 




SSN 


TELEPHONE 


PAYMENT 


COMPANY 


234599133:0 R 


703-708-4429:0 G 


703-993-1629:0 G 


Sprint:0 G 


234599133:0 G 


703-708-4428:0 R 


703-708-4427:0 G 


AT&T:o R 


234599133:0 G 


703-708-4428:0 R 


703-708-4429:0 R 


AT&T:0 G 


234599133:0 R 


703-993-1629:0 G 


703-708-4429:0 R 


AT&T:0 G 


234599133:0 R 


703-993-1629:0 G 


703-993-1629:0 G 


Sprint:0 G 


245569789:0 G 


703-993-1628:0 G 


703-708-4429:0 R 


AT&T:0 G 


245569789:0 G 


703-993-1628:0 G 


703-993-1629:0 G 


Sprint:0 G 



Table 20 The result of pre- under f interpretation 



SSN 


TELEPHONE 


PAYMENT 


COMPANY 


245569789:0 G 


703-993-1628:0 G 


703-993-1629:0 G 


Sprint:0 G 



Table 21 The result of under J interpretation 



SSN 


TELEPHONE 


PAYMENT 


COMPANY 


234599133:0 R 


703-708-4429:0 G 


703-993-1629:0 G 


Sprint:0 G 


234599133:0 G 


703-708-4428:0 R 


703-708-4427:0 G 


AT&T:0 R 


234599133:0 G 


703-708-4428:0 R 


703-708-4429:0 R 


AT&T:0 G 


245569789:0 G 


703-993-1628:0 G 


703-708-4429:0 R 


AT&;T:0 G 


245569789:0 G 


703-993-1628:0 G 


703-993-1629:0 G 


Sprint:0 G 



Table 22 The result of under f interpretation 
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Abstract 

One aspect of maintaining integrity in information systems is establishing an 
organizational environment that will prevent the damage caused by external agents. 
One particularly insidious such agent is the computer virus. It can alter data often 
without the owner or user of that data being aware. Establishing such an 
environment can often rely on the availability of metrics for organizational 
characteristics associated with harm to data integrity. This paper will focus on the 
development of organizational metrics for the threat of computer viruses. It is 
expected that many of these metrics will apply to other threats to data integrity 
although we have not pursued that line of research. 

In the case of computer viruses and some other types of malicious code, the formal 
analogies to the infection dynamics of biological viruses permit the utilization of 
epidemiological concepts in the development of metrics. This paper demonstrates 
how a simple epidemiological model of computer viruses provides insights about the 
importance of several metrics: 

• frequency of contact; 

• utilization of antiviral software; 
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• effectiveness of antiviral software; 

• likelihood of notifying other people about computer viruses detected; 

• frequency of updating antiviral software. 

The dynamics of computer virus transmission will be combined with broader 
environmental factors of an organization in a computer simulation model that 
elaborates on the simple one shown here. In addition, epidemiological concepts 
from environmental health permit the integration of environmental perspectives and 
disease control. This paper shows how these concepts are guiding the development 
of survey methods to obtain information for the construction and validation of the 
computer simulation model. The benefits of a broader analysis of an organization’s 
environment are an understanding of how organizational complexities influence the 
risk associated with computer viruses, a representation of organizational factors in 
terms familiar to those making decisions and integration of assessment of risk from 
multiple threats. 

Strengths and weaknesses of various methodologies are discussed. Looking 
beyond the current project, future research is suggested in the areas of integrated 
assessment modeling, data collection strategies, and the threats presented by new 
kinds of malicious code. 



Keywords 

Mathematical model, risk assessment, computer virus, malicious code, computer 
simulation, information security metric 



1 INTRODUCTION 

One aspect of maintaining integrity in information systems is establishing an 
organizational environment that will prevent the damage caused by external agents. 
One particularly insidious such agent is the computer virus. It can alter data often 
without the owner or user of that data being aware. Establishing such an 
environment can often rely on the availability of metrics for organizational 
characteristics associated with harm to data integrity. This paper will focus on the 
development of organizational metrics for the threat of computer viruses. It is 
expected that many of these metrics will apply to other threats to data integrity 
although we have not pursued that line of research. 

Paradoxically, computer viruses may be scarcely mentioned in an overview of 
issues in information security and integrity management (GAO, 1998), while a 
large percentage of organizations report computer virus incidents despite 
widespread use of antiviral software (Power, 1998). The proper assessment of 
computer viruses in information security and integrity management depends on 
estimates of the risk and impact of computer virus incidents and an analysis of how 
they are influenced by various factors in the computing environment. Since 
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relatively little good data exist, the methods for risk assessment include 
mathematical or computer simulation models to synthesize available information 
and provide a theoretical basis for the development of metrics. Development of 
metrics in turn can guide additional data collection efforts. 

The first step is to use formal analogies with the transmission of biological 
viruses. The use of these analogies is justified as mathematical formulations of 
propagation processes that are not restricted to biological agents and indeed can be 
applied to various physical phenomena. In the early 1990s, there was a flurry of 
interest in models of the dynamics of the transmission and control of computer 
viruses (NCC, 1991). The best example of this work is from the IBM research labs 
(Kephart and White, 1991; Kephart and White, 1993). This work offers a 
compelling explanation of the observation that computer viruses never became a 
problem for all (or nearly all) computers in all organizations. Much of the 
subsequent effort in this area has been directed at the development of better software 
tools to control viruses (Kephart et al, 1997). However, in the midst of a growing 
recognition of the importance of databases for information security and integrity 
management (GAO, 1998), there is a need to embed the concepts from computer 
viruses in a broader framework of organizational metrics. 

Section 2 demonstrates how a simple epidemiological model of computer viruses 
provides insights about the importance of measures of (i.e., ‘metrics’) for frequency 
of contact, utilization of antiviral software, effectiveness of antiviral software, 
likelihood of notifying other people about computer viruses detected, and frequency 
of updating antiviral software. The dynamics of computer virus transmission will be 
combined with broader environmental factors of an organization in a computer 
simulation model that elaborates on the simple one shown here. Epidemiological 
concepts from environmental health permit the integration of environmental 
perspectives and disease control. Section 3 describes the organizational factors 
considered in the development of survey methods to obtain information for the 
construction and validation of the computer simulation model. Finally, Section 4 
discusses the strengths and weaknesses of the modeling approach and the survey 
design. Topics for future research are suggested in the areas of integrated 
assessment modeling, data collection strategies, and the threats presented by new 
kinds of malicious code. 



2 SIMPLE EPIDEMIOLOGICAL MODEL OF COMPUTER VIRUSES 

The aim of the research discussed in this paper is to develop a mathematical model 
that will characterize individual outbreaks of computer viruses in an organization of 
computer users whose computers define a population of interest. The simple model 
presented here illustrates basic principles for the control of computer viruses and 
forms the basis for a more detailed computer simulation model that will be used to 
construct organizational metrics. 
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An individual outbreak is caused by the replication of a single computer virus. A 
computer is infected and infective when the single computer virus that could be 
transmitted to another computer is in the computer’s memory or an application 
program on its hard drive. Immunity in this context means that antiviral software 
installed on a computer can be used effectively for detection and prevention of 
infection. Immunity has to be considered separately for specific viruses because 
antiviral software may be effective against some viruses and not others. The term 
partial immunity is used because computer viruses that can be detected by installed 
antiviral software are sometimes missed because the software is not always used 
effectively. If the antiviral software cannot detect the computer virus at all, then the 
virus is considered to be undetectable. 

The basic reproduction ratio is a useful metric from epidemiology that represents 
the potential spread of infection from a single infected individual. The basic 
reproduction ratio characterizes threshold behavior of the dynamics of transmission. 
If the ratio is less than unity, then the infection does not spread. If the ratio is 
greater than unity, the infection does spread. Larger values of the ratio indicate a 
greater amount of spread. The basic reproduction ratio is the product of contacts per 
day, the duration of infection, and the fraction of contacts that are susceptible. The 
structure of the model and the results are described in more detail in the Appendix. 

The parameter values are intended to be illustrative. That the results are sensitive 
to these values demonstrates that these parameters are good candidates for 
organizational metrics. The analysis of the basic reproduction ratio in Table 4 
makes clear that transmission will be increased not only in organizations with 
higher rates of contacts that permit transmission of computer viruses, but also with 
organizations in which infections remain undetected for longer periods. The 
beneficial effect of the notification policy in Table 4 is a result of people acting 
promptly to notify system administrators of computer virus problems and reduce the 
period in which infections remain undetected. The basic reproduction ratio can be 
readily calculated for any choice of parameters. 

However, the calculation of outbreak sizes from the basic reproduction ratios is 
not so simple. Although the general relationships among the outbreak sizes in Table 
2 are explained by the basic reproduction ratios in Table 4, it should also be noted 
that the outbreak sizes are somewhat smaller overall than might be expected from 
the corresponding basic reproduction ratios. The reason for the difference is the 
assumption that the first detection of an infected computer stops the outbreak. As 
more computers are infected in the outbreak, there is a greater chance that at least 
one of the infected computers will be detected. Consequently, the outbreak is usually 
terminated before the initial infection completes the duration of infection expected if 
it were the only infection. In general, computer simulations are needed to 
characterize the effects of complex organizational behavior like patterns of 
reporting. 

Another issue is the effect of updating software. A major concern is the generation 
of new computer viruses. The frequency of updating antiviral software is another 
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important organizational metric. Clearly, large numbers of computers can be 
infected if antiviral software cannot detect a computer virus (see the undetectable 
scenario in Table 2). In this scenario, all of the computers are considered to be 
susceptible with no immunity at all. Fortunately, a large majority of incidents are 
caused by a small number of computer viruses (NCSA, 1996). Although new 
computer viruses are generated all the time, very few computer viruses reach large 
numbers of computers. Therefore, the percentage of exposures to computer viruses 
that result in the undetectable scenario is expected to be small. The percentage of 
such exposures can and should be reduced by regular updates of antiviral software, 
but a focus on increasing the frequency of updating to higher and higher levels will 
have a progressively smaller effect on the number of computer virus incidents. The 
problem of undetectable computer viruses is that of relatively rare but highly 
damaging events. 



3 RISK FACTORS IN ORGANIZATIONAL ENVIRONMENT 

The dynamics of computer virus transmission will be combined with broader 
environmental factors of an organization in a computer simulation model that 
elaborates on the simple one shown above. Epidemiological concepts from 
environmental health permit the integration of environmental perspectives and 
disease control (Beaglehole et al, 1993). These concepts are guiding the 
development of survey methods to obtain information for the construction and 
validation of the computer simulation model. Computer security metrics are 
analogous to environmental health indicators, which are defined as an expression of 
the link between environment and health, targeted at an issue of specific policy or 
manageniCnt concern and presented in a form which facilitates interpretation for 
effective decision-making (Briggs et al, 1996). An event, condition, characteristic or 
combination of such may be associated with a disease in that the presence of a factor 
indicates an increased risk of occurrence of a disease. Establishing associations is 
an important part of linking the environment with disease. In applying these 
concepts to computer security, adverse security outcomes replace adverse health 
outcomes. The benefits of a broader analysis of an organization’s environment are 
an understanding of how organizational complexities influence the risk associated 
with computer viruses, a representation of organizational factors in terms familiar 
to those making decisions and integration of assessment of risk from multiple 
threats. Therefore, the pertinent information about an organizational environment 
includes parameters specific to computer viruses and characteristics of the security 
environment in general. 

We are fielding a survey in 1998 via the World Wide Web with the focus at the 
level of a workgroup because the security practices of workgroups may be quite 
heterogeneous in a large organization. The survey has several sections, including: 
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• basic demographic information about the organization and the workgroup of 
the respondent; 

• views of the respondent about the threat of computer viruses and general 
aspects of protection; 

• system environment including groupware, network management and 
organizational turnover; 

• practices for sharing files, ranging from physical media to shared network 
volumes and internet services 

• practices for system protection, ranging from URL filtering, usage of antiviral 
software, reporting of computer viruses to system administrators to user 
training about information security. 

The environmental variables are being analyzed in relation to variables about 
computer virus experience over the past twelve months in the respondent’s 
workgroup. Experience takes into account not only the frequency of occurrence of a 
computer virus incidents, but also the impact of such incidents on the organization. 
In addition to questions about recent experience, questions are asked about serious 
computer virus incidents at any time, where the respondent is told that an incident 
could be serious because of size, duration, information lost or activities disrupted. It 
is desirable to ask about serious computer virus incidents because many computer 
virus incidents have limited impact and therefore low priority as a security issue. 
However, it is necessary to ask about serious computer virus incidents with a longer 
time frame because such incidents are relatively rare. Although respondents are 
asked about factors that were responsible for the most recent serious computer virus 
incident and its impact, it is not possible to analyze these incidents in relation to all 
of the other environmental variables because of the longer time frame and possibly 
different location. 



4 DISCUSSION 

This paper builds on models of the epidemiology of computer viruses, especially the 
work from the IBM research lab from the early 1990s (Kephart and White, 1991; 
Kephart and White, 1993). Like the earlier models, there is a basic threshold of 
contact required for spread of infection that depends on how long infections remain 
undetected. Notification policies can be very effective at reducing transmission even 
though the policies in different models are implemented quite differently. 

The most important difference is the assumption about the pattern of contact. The 
IBM work assumes that transmission patterns are tightly clustered whereas the 
model in Section 2 and the Appendix assumes a simple pattern of random contact. 
The difference arises in part because of the different aims of the models. The aim of 
the IBM study was to make general statements about the phenomenon of clustering 
of contacts in small groups. The assumption of tight clustering leads to the 
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conclusion that the prevalence of computer viruses will rise slowly even in the 
absence of antiviral protection, although antiviral protection does help reduce the 
prevalence. Occasional large outbreaks are not a contradiction; the aim is an 
examination of all susceptible computers in the population. However, there was no 
attempt to collect detailed data on the patterns of contact that would be required to 
construct security metrics for particular workgroups. For this purpose, it is 
desirable to keep the model structure and associated data requirements as simple as 
possible. 

Another reason for the differences in the assumptions about contact patterns is the 
different time periods involved. Highly clustered contact patterns are probably less 
important in limiting transmission of computer viruses in the computing 
environment of the late 1990s because several factors in recent years have increased 
opportunities for transmission. There is more electronic mail allowing attachments, 
intranet servers allowing ready access to documents across an organization, as well 
as the explosive growth of the internet. Associated organizational changes foster 
more contact across and within organizations. Workers are shuffled in and out of 
flexible teams while more electronic communication is shared between clients and 
vendors. It remains to be seen if a model structure based on random contacts 
produces a reasonable estimate of computer virus risk. The answer will probably 
depend on the size of the workgroup, with reasonable estimates for smaller groups 
and gross overestimates for large organizations. It is also difficult to treat large 
organizations as a unit because of heterogeneity in security practices within 
organizations. The appropriate linkage of security issues at different levels of the 
organization requires more attention. 

The development of models for integrated assessment also demands a greater 
connection between the studies of the spread of computer viruses and the 
organizational environment. In order to achieve this integration, it is necessary to 
collect relevant data from organizations. Little or no useful data exist that correlate 
specific losses to the integrity or confidentiality of computer data to specific 
incidents. One of the more detailed computer virus surveys in the U.S. that is 
publicly available is an annual telephone survey of organizations conducted by the 
National Computer Security Association (now the International Computer Security 
Association). This survey, whose script is provided elsewhere (NCSA, 1996), 
contributes useful information about the types of viruses in circulation, routes of 
entry into the organization and types of antiviral software. However, the survey 
being fielded by us has more organizational information such as attitudes towards 
computer virus security, general information security training and turnover in the 
organization. To some extent, this reflects the difference between a focus on specific 
brands of technology products and a focus on the organizational environment; it is 
probably impossible to do justice to both areas in one survey. 

The General Accounting Office has found a growing interest in more precise 
measurement of costs and benefits of security-related policies (GAO, 1998). Better 
characterization of the severity of security and integrity incidents is clearly needed. 
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A recent study of Internet security incidents reported to the Computer Emergency 
Response Team at Carnegie Mellon University combined severity measures related 
to duration, the number of sites involved and the number of messages generated 
(Howard, 1997). Research on impacts will have to integrate computer virus issues 
with other security issues because of the crosscutting effects of security policies. For 
example, investment in better backup services has to be justified on the basis of all 
threats to data integrity. 

In addition to generating databases for internal management, there has to be an 
effort to overcome barriers to external reporting of computer security information. 
Examination of security problems across organizations provides important insights 
about threats and the effective organizational strategies. Tools that allow such 
information to be gathered while maintaining confidentiality are a challenge. 

Finally, any effort to develop metrics for computer virus risk must be aware of the 
evolving threat. The nature of the computer virus threat has already changed as the 
predominant operating systems have changed (Kephart et al, 1997). In the future, 
there may be new hybrids of computer viruses with malicious code that slips in 
through inadvertent access of URLs because of the trend of becoming unaware of or 
indifferent to the physical location of accessed files. This might be in the metadata 
of a Microsoft Word document or the HTML in electronic mail. Computer viruses 
could attack software agents that are authorized to send electronic mail with 
attachments, in some ways similar to the Internet worm that rapidly made copies of 
itself and sought new hosts (Kephart et al, 1997). At another level, evolutionary 
computational models allow programs themselves to evolve in ways analogous to 
natural selection, removing the human programmer from the loop (Burke, 1998). 
Clearly, integrated assessment of computer virus risk has to be embedded in a larger 
context of malicious code and information security. 
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8 APPENDIX 

The basic structure of the model borrows from the SIR model in epidemiology, 
which classifies people as susceptibles (S), infectives (I) and removals or recovereds 
(R). This model and its extensions are commonly applied to so-called childhood 
immunizable diseases like measles in which susceptible people acquire the infection 
from contact with infective individuals who then recover to become immune from 
reinfection. 

In adapting the model to analyze a single computer virus outbreak in an 
organization, the concepts of susceptibility, infection and immunity must be refined 
as shown in Figure 1. Contact means a behavior that allows the transmission of a 
computer virus to another computer, such as the distribution of a diskette or sending 
document (e.g., Microsoft Word) that contains a macro virus in the document’s 
metadata. The intermediate copies of viruses, such as those on removable media like 
diskettes, are not explicitly counted. There may be some computers that have no 
defenses against the computer virus and are of course considered susceptible. 
However, some computers have defenses that are effective some of the time but not 
all of the time. For example, if antiviral software is effective but requires manual 
operation to scan potentially infected files, the degree of protection depends on the 
diligence of the user. These computers are classified as susceptibles that are 
partially immune and would be expected to have less exposure than a fiilly 
susceptible computer. As a special case, these computers may be totally immune 
and never allow infection. A computer virus infection terminates when it is 
detected, where the detected class is analogous to the recovered class in the SIR 
model. If the computer is fully susceptible, detection occurs when the user 
recognizes symptoms of the virus infection. If the computer is partially immune, 
detection is likely to occur through operation of antiviral software, although 
recognition of the symptoms of infection may also play a role. It is expected that 
the duration of infection among the partially immune computers will be 
considerably shortened because of the use of antiviral software. Detection of any 
infected computer is assumed to generate action that will halt the outbreak. 

The model is based on a discrete time simulation having a daily time step. The 
actual construction of the model is done with the simulation package STELLA 
Research from High Performance Systems, Inc. STELLA Research is identical to 
HPS’s iThink. The number of infections generated per time step depends on factors 
shown in Table 1 (see also Figure 1). The daily contact rate is the average number 
of contacts per day that could allow transmission of a computer virus from one 
computer to another in an outbreak. The fraction infective is the fraction of the 
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computers that are already infective, aggregating computers with no immunity and 
with partial immunity. The checking frequency accounts for the fact that 
susceptible computers with partial immunity have reduced exposure depending on 
the likelihood that the software is actually used to check potential sources of 
infection when it is supposed to be used. The product, pO)(C)(F), is the average 
number of infections generated per time step among susceptibles with no immunity. 
The product, (X2)(C)(F)(J-K), is the average number of new infections generated 
per time step among susceptibles with partial immunity. The STELLA software 
package is used to construct a stochastic formulation by assigning the formula as the 
mean of a Poisson distribution. 

Table 1 States and parameters of the model 



Symbol 


Description 


XI 


Number of susceptibles with no immunity 


X2 


Number of susceptibles with partial immunity 


Y1 


Number of infectives with no immunity 


Y2 


Number of infectives with partial immunity 


D1 


Duration of infection with no immunity 


D2 


Duration of infection with partial immunity 


M 


Fraction partially immune 


C 


Daily contact rate 


F 


Fraction infective 


K 


Checking frequency 


P 


Compliance frequency 



The number of infections detected per time step also depends on factors shown in 
Table 1. The primary focus is on when the user detects problems in his or her own 
computer. The duration of infection corresponds to the time interval starting with 
the infection of a computer and terminating with the detection of the infection on 
that computer by the user. The average duration of infection would be expected to 
be rather short with partial immunity afforded by antiviral software and rather long 
when the user must recognize symptoms. If the duration of infection is a single day, 
then all infected computers are transferred to the detected class in one time step. 
Using the single day as a minimum duration, infections of longer duration result in 
a fraction of the infective computers being detected each day, corresponding to the 
inverse of the duration of infection after the first day is subtracted off. For 
infections with duration longer than a day, the STELLA software package is used to 
construct a stochastic formulation by assigning the ratio, (Y1)/(D1 - 7), for 
computers without immunity and the ratio, (Y2)/(D2 - 1), for computers with partial 
immunity, as the mean of a Poisson distribution for the number of computers 
detected per time step. A more accurate distribution would be the Binomial, but a 
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built-in variate with this distribution is not available in STELLA, so we have 
approximated the distribution with the Poisson. The Poisson distribution is a 
reasonable approximation to the Binomial when the daily probability of detection of 
an individual infected computer is low (20% or less) and the mean duration of 
infection is reasonably long (at least 5 days). In many computing environments, it 
is expected that the duration of infection is either a single day or several days long. 

An additional process of notification may occur if users of uninfected computers 
detect the presence of a computer virus in someone else’s computer because 
transmission of the computer virus was detected. This scenario assumes that 
susceptibles with partial immunity may detect transmission and then report the 
problem, thereby activating the same response as if the user of the infected 
computer had reported the problem. The number of infections notified in this way 
per time step depends on the compliance frequency, which represents the chance 
that a user who detects such an infection reports it properly. The product, 
(X2)(C)(F)(K)(P), is the average number of new infections notified per time step by 
susceptibles with partial immunity. The STELLA software package is used to 
construct a stochastic formulation by assigning the formula as the mean of a Poisson 
distribution. 

Scenarios are defined by the selection of particular values of the parameters as 
shown in Table 2. In comparison to the baseline scenario, the coverage scenario 
increases the percentage of computers with antiviral software from 80% to 100%, 
the efficacy scenario increases the checking frequency from 80% to 100% and the 
notification scenario increases the compliance frequency from 0% to 50%. These 
three scenarios correspond to infection control strategies of improving 
immunization coverage, improving the efficacy of a vaccine, and improving early 
notification of infection, respectively. The last scenario for an undetectable 
computer virus corresponds to the situation when installed antiviral software is 
ineffective even when properly used. 

The outbreak begins in a population of 100 computers with a single infective 
computer that may have no immunity or partial immunity. Each combination was 
iterated 100 times with a daily contact rate of 4, a duration of infection of a single 
day if a computer has partial immunity and a mean duration of infection of 10 days 
if a computer has no immunity. The mean, minimum and maximum of the outbreak 
sizes (including the initial infective computer) are shown in Table 2 corresponding 
to the two different starting conditions. The minimum and maximum are shown in 
brackets following the mean outbreak size. The most obvious effect, of course, is 
that an absence of any effective antiviral software guarantees extensive spread of a 
computer virus once introduced. However, it is less obvious that transmission is 
reduced more by improving the distribution (coverage) of antiviral software than by 
improving the efficacy of antiviral software. Even more remarkable is the effect of a 
notification policy on reducing the spread of a computer virus without any 
improvements in the technology. 
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Table 2 Mean, minimum and maximum outbreak sizes for different scenarios 



Scenario 


Fraction 

partially 

immune 


Checking 

frequency 


Compliance 

frequency 


Start with 
no 

immunity 


Start with 

partial 

immunity 


Baseline 


0.80 


0.80 


0 


9.17[2,34] 


2.48[1,6] 


Coverage 


1.00 


0.80 


0 


N/A 


1.84[1,5] 


Efficacy 


0.80 


1.00 


0 


7.86[1,20] 


N/A 


Notification 


0.80 


0.80 


0.50 


3.59[1.15] 


2.46(1,7] 


Undetectable 


0 


0 


0 


53.16 


N/A 



[3,100] 



In order to better understand the effects of these strategies, a useful theoretical 
construct is the basic reproduction ratio, which represents the potential spread of 
infection from a single infected individual. The basic reproduction ratio 
characterizes threshold behavior of the dynamics of transmission. If the ratio is less 
than unity, then the infection does not spread. If the ratio is greater than unity, the 
infection does spread. Larger values of the ratio indicate a greater amount of 
spread. 

Central to the construct of the basic reproduction ratio is the classification of 
random contacts at the initiation of an outbreak according to the immune state of 
the computer of the computer contacted as shown in Table 3. The approximate 
frequencies of each type of contact for the baseline scenario and the derivations are 
also shown in Table 3 (see also Table 1). The exact frequencies are slightly different 
because the first computer infected is removed from the susceptible population; the 
actual adjustment depends on the type of the first computer to be infected. As an 
outbreak progresses, the exact frequencies continue to change as the composition of 
the susceptible population changes. 

Table 3 Frequencies of type of computer contact for baseline scenario 



Type of computer contact Frequency Derivation 

No antiviral software 20% 1 - M 

Antiviral software not used effectively 16% (M)(l - K) 

Antiviral software used effectively 64% (M)(K) 



The basic reproduction ratio is the product of contacts per day, the duration of 
infection, and the fraction of contacts that are susceptible as shown in Table 4. 
Note that, unlike the outbreak sizes shown in Table 2, the basic reproduction ratio 
does not include the initial infective computer. For all scenarios, the contact rate is 
the parameter described earlier. The other factors in the basic reproduction ratio 
require additional explanation. For each scenario, the fraction of contacts that is 
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susceptible is derived from summing of the frequencies of the two types of contact 
shown in Table 3 that lead to infection. For the baseline scenario this is 20% + 
16%. The duration of infection is a weighted average of the mean duration of 
infection in a computer with no immunity (assumed to have a mean of 10 days) and 
the duration of infection in a computer with partial immunity (assumed to be a 
single day), where the weights are the relative frequencies of an infection in each 
category. For the baseline, among contacts that lead to infection, 56% (20/36) is the 
chance that a computer has no immunity and 44% (16/36) is the chance that a 
computer has partial immunity. So the duration of infection is shown in Table 4 as 
(56%)(10) + (44%)(1). The ratio of 8.7 means that an average of 8.7 infected 
computers would be expected from an initial infected computer. The coverage 
scenario drastically reduces the basic reproduction ratio to 0.8 by reducing the 
fraction of contacts that are susceptible and the typical duration of infection. In 
contrast, the efficacy scenario has only a modest effect on the basic reproduction 
ratio because computers without any antiviral protection continue to transmit for 
several days if infected. The effect of the notification policy comes entirely from 
changes in the duration of infection because infection is detected earlier if the initial 
infection is in a computer without immunity (the duration of infection in a computer 
with partial immunity cannot be reduced to less than a day). The reduction of the 
mean duration of infection from 10 days to 1.4 days for a computer without 
immunity is calculated from a Poisson process whose mean of 1.28 is the product of 
three factors: 4 contacts per day; 64% (the chance that a contact detects the 
infection); and 50% (the chance that someone who detects the infection notifies 
system administrators). The probability of no one notifying in a day is exp(-L28), 
i.e. 28%, so that the daily probability of notification if there is an infected computer 
without immunity is 72% and the mean time till detection is 1/.72 = 1.4 days. For 
the notification scenario, the overall mean duration of 1.2 days is the weighted 
average allowing for infected computers with no immunity or partial immunity, 
where the weights are exactly the same as used in the baseline scenario. 

Table 4 Basic reproduction ratios for different scenarios 



Scenario 


Daily 

contact 

rate 


Duration of infection 
(days) 


Fraction of 

contacts 

susceptible 


Basic 

reproduction 

ratio 


Baseline 


4 


6.04 (lOx.56 + lx.44) 


0.36 


8.7 


Coverage 


4 


1.0 


0.20 


0.8 


Efficacy 


4 


10.0 


0.20 


8.0 


Notification 


4 


1.2 (1.4x.56+ 1x44) 


0.36 


1.7 


Undetectable 


4 


10.0 


1.00 


40.0 
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DETECTED COMPUTERS 



TERMINATION OF OUTBREAK IF USERS REPORT 
TO SYSTEM ADMINISTRATORS WHO RESPOND 
IMMEDIATELY 



ORGANIZATIONAL ENVIRONMENT 



Figure 1 Epidemiological states of computer virus infection. 




Integrity Control of Spreadsheets: 
Organisation & Tools 



K. Rajalingham, D. Chadwick 

Information Integrity Research Group^ 

School of Computing & Mathematical Sciences, University of 
Greenwich, 

Wellington Street, Woolwich, London SE18 6PF, 

United Kingdom 
Phone: +44 (0)181 331 8510 
Fax: +44 (0)181 3318665 
Email: rk09@gre.ac.uk 



Abstract 

This paper describes a new approach to the provision of a software engineering 
discipline for the spreadsheet life-cycle: from requirements, to design, 
implementation and subsequent maintenance. This approach addresses the 
widespread problem of spreadsheet errors, which are analysed in detail and the 
outcome of which is a more comprehensive classification of errors than presented 
before. It is accompanied and supported by appropriate examples. 

This new approach differs from other approaches in that it is based on an analysis 
of spreadsheet structure. The elements of this approach are explained and 
illustrated with examples. It's potential for integrity control is also discussed, 
especially based on the types of errors it will detect, reduce or prevent. 
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1 INTRODUCTION 

1.1 The Problem and its Magnitude 

Numerous recent publications have indicated the seriousness of spreadsheet errors 
and their adverse impact or potential impact on businesses. 

A recent financial model review by KPMG Management Consulting, London 
(KPMG, 1997) confirms the frequency and seriousness of spreadsheet errors. Their 
report states that in 95% of the financial models reviewed, at least 5 errors were 
found. The review also reveals alarming statistics pertaining to defects in 
spreadsheet development, addressing the project management, technical and 
analysis aspects. As a result of these findings, KPMG Management Consulting 
(London) has engaged in a collaboration with the Information Integrity Research 
Group of the University of Greenwich, in a research into integrity control in 
spreadsheets. This paper is part of that research. 

An article in New Scientist (Ward, 1997) has reported that a study conducted by 
the British accounting firm Coopers & Lybrand found errors in 90% of the 
spreadsheets audited. This is an extremely high figure and if the errors went 
undetected, it could have had a devastating effect on the business. The same source 
has also indicated that a decade’s worth of research findings of Professor Raymond 
Panko at the University of Hawaii revealed that spreadsheets had a dangerously 
high rate of errors. 

There are also publications from more than a decade ago that clearly state that 
spreadsheet errors have caused serious disruption of business. Although these cases 
are not based on formal research, they do show that spreadsheet errors were 
considered important enough to be reported in the general business and computing 
press. For instance, according to an article in Personal Computing (Ditlea, 1987, 
January), a Houston consultant with Price Waterhouse had found 128 errors in 4 
spreadsheet models that had already been in use for months. 

These reports indicate that the occurrence of spreadsheet errors is a major problem 
for businesses and needs to be addressed urgently. 

1.2 The Need for a New Approach to the Discipline for 
Spreadsheet Development 

Findings from research carried out over the last few years show the need for a new 
approach or discipline for spreadsheet development. This is evident from the call, 
or implied call, for a new approach in many recent publications. Panko and 
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Halverson (1996, January) distinctly state the need to adopt strict programming 
disciplines in dealing with complex spreadsheets. 

The journal paper by Panko (Panko, forthcoming) says that there is an obvious 
need to begin adopting traditional programming disciplines due to the similarity 
between spreadsheet errors and programming errors. The paper also states that 
there is far too little knowledge of spreadsheet errors, which implies that much 
more research has to be undertaken into spreadsheet errors. 

It is stated by Hendry and Green (1994, January) that instead of creating the whole 
spreadsheet first and then checking for errors, errors ought to be checked for at 
various stages of the development process. This will make it easier to trace and 
correct errors. This strategy of stage-by-stage component testing is a software 
engineering principle. 

Based on these published reports, we can come to the conclusion that there is a 
need for the imposition of a programming or software engineering-based discipline 
in spreadsheet development. This will help address the currently major problem of 
spreadsheet errors. 



2 CLASSIFICATION OF SPREADSHEET ERRORS 

A thorough review of literature relevant to spreadsheet development and errors 
reveals that very little research has been done on studying specific errors that occur 
in spreadsheets. One such research was undertaken by Chadwick et al (1997, 
December). 

This paper presents a more comprehensive classification of spreadsheet errors as 
shown in Figure 1, This classification is illustrated with examples of specific 
errors. This is based on work already done as well as original work produced by 
the authors of this paper. 

In the broadest terms, spreadsheet errors can be divided into two major categories, 
namely software errors and user errors. 

Software errors 

Software errors are errors made by the spreadsheet software and therefore their 
occurrence is generally beyond the control of users, although they can, when 
aware, take remedial action. 




Spreadsheet Errors 
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Example: Year 2000 Error 

In MS Excel 5 for instance, for any entry of a date (without the century) before 
01/01/20, the century is assumed to be the 21st century while for any entry of a 
date (without the century) after 01/01/20, the century is assumed to be the 20th 
century. This problem, of course, can be avoided if the year is explicitly entered 
with the century e.g. 09/02/1915, 03/12/2060. 

User errors 

User errors are errors committed by the user and can be prevented, detected and 
corrected. They can be divided into two major categories at the highest level, 
namely qualitative errors and quantitative errors. This is based on Ray Panko's 
framework for classifying spreadsheet errors (Panko & Halverson, 1996, January). 

The spreadsheet model below has been adapted from the one illustrated in the 
paper by Chadwick et al (Chadwick et al, 1997, December). Most of the errors 
explained in this section will be based on this model. 



C D E F G H 





Number 
of Staff 


Day 
Wa g es € 


Night 
Wages £ 


Total Wages 
£ 


Average 

Wage£ 


Grade 1 


1 


17700.50 


0.00 


=SUM(E6:F6) 






3 


45540,00 


1400.55 


=SUM(E7:F7) 




Grade 3 


9 


122340.00 


2000.00 


=SUM(E8:F8) 




Grade 4 


12 


102350.25 


0.00 




=G9/D9 


Grand 

Total 


=SUM(D6: 

D9) 


=SUM{E6: 

E9) 


^SUM(F 

6:F9) 


^SUM(GS:G9) 


=G10/D10 



CD E F G H 





Number 
of Staff 


Day 
Wages £ 


Night 
Wages £ 


Total 
Wages £ 


Average 

Wage£ 


Grade 1 


1 


17700.50 


0.00 


17700.50 


17700.50 


Grade 2 


3 


45540,00 


1400.55 


46940.55 


15646.85 


Grade 3 


9 


122340,00 


2000.00 


124340.00 


13815.56 


Grade 4 


12 


102350.25 


0.00 


102350.25 


8529,19 


Grand 

Total 


25 


287930.75 


3400,55 


291331.30 


11653.25 



(Output) 



Figure 2 



Staff Budget Costs 1995-1996 
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Qualitative errors 

Qualitative errors are errors that do not immediately produce incorrect numeric 
values on the spreadsheet itself. However, they do reduce the quality of the 
spreadsheet produced as it becomes prone to misinterpretation on the part of the 
user. As a result, it also becomes more difficult to update and maintain the model. 
A more detailed investigation into qualitative errors revealed that they can be 
generally divided into two different types, namely, formatting errors and 
decision errors. 

Formatting errors 

Formatting errors are qualitative errors that occur due to lack of uniformity in the 
formatting of similar data. This could lead to an incorrect interpretation of their 
values. An example of a formatting error is the Money Format: Commas, 
Standard, Decimalisation error. 

Example 1: Money Format: Commas y Standard, Decimalisation 
This error is mentioned with an example by Chadwick et al (1997, December) 
under Errors in 'Planning Skills'. It is a common qualitative error where the cell 
format is specified as General on the spreadsheet. Consequently, the figures have 
varying decimal places and make it difficult to identify a number that is incorrect 
by a magnitude of 10, 100 etc. 

Decision errors 

Decision errors are qualitative errors that occur due to an incorrect decision, choice 
or assumption that leads to the intentional entry or referencing of an inappropriate 
piece of data, especially numeric. As far as qualitative errors are concerned, 
decision errors are far more difficult to detect compared to formatting errors. 

Example 2: Qualitative Error Resulting from the Referencing of Non- 
current Data 

This is an example of a qualitative error produced as a result of referencing a piece 
of data that has become invalid due to time lapse. In the example given below 
{Figure 3), this piece of data is the exchange rate from British Pounds (£) to 
Ringgit Malaysia (RM) contained in cell D2. If the exchange rate undergoes acute 
fluctuations and the changes are not reflected in cell D2, the calculation in cell A8 
produces a value that is invalid. This is a qualitative error and any decision made 
based on this value would be unreliable. 
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ABC D 





Tea(£) 


Milk(£) 


Exchange Rate 
(£ to RM) 


1994 


450 


560 


7.3 


1995 


904 


900 




1996 


872 


800 




1997 


123 


234 












Total Sale of Tea 
& Milk in RM 








35354 









1 

2 

3 

4 

5 

6 

7 

8 



Figure 3 Qualitative, decision error. 

Example 3: Hard-coding of a Formula 

The hard-coding of a formula is another example of a qualitative, decision error. 
This error decreases the quality of the spreadsheet by making it much less flexible. 
Referring to Figure 2, if the formulae in column H were hard-coded e.g. =G8/9 (in 
cell H8) instead of =G8/D8, and if any of the values in column D (number of staff) 
changed, the formula in column H of the same row would have to be re-written. 
This is just a simple example to illustrate the concept of hard-coding being a 
source of error. 

Quantitative errors 

Quantitative errors, in turn, are numerical errors that lead to incorrect bottom-line 
values. This definition is offered by Panko and Halverson (1996, January) who 
also state that a simple trichotomy captures the diversity of quantitative errors 
reasonably well. They are mechanical, logic and omission errors. 

Mechanical errors 

Mechanical errors are simply typing errors. Though quite frequently occurring, 
they have a high chance of being spotted and corrected immediately by the person 
committing the error. Some, however, do go undetected and could lead to incorrect 
values in other cells. Mechanical errors can be divided into two distinct categories. 
They are overwriting errors and data input errors. 

Overwriting errors 

An overwriting error is said to have occurred when a correct piece of data or 
formula is unintentionally replaced or overwritten by an incorrect piece of data. 
This error is also mentioned under Errors in Enabling Skills' by Chadwick et al 
(1997, December). There are two types of overwriting errors; overwriting of 
unreferenced data and overwriting of referenced data. Each of these can either 
be a formula or a directly entered value. 
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Data input errors 

These are errors made by users while entering data into the spreadsheet. Just like 
overwriting errors, there are two types of data input errors; error in input of 
unreferenced data and error in input of referenced data. Unlike the former, the 
latter produces an incorrect piece of data in a cell which is subsequently referenced 
by at least one formula. Consequently, the return value sent to the formula cell(s) 
are also incorrect. Each of these incorrect entries can either be a formula or a 
directly entered value. As a result of carelessly entering incorrect cell or range 
addresses into formulae, the formulae themselves produce incorrect return values, 
for instance, by referencing blank cells or non-numeric cells. 

Logic errors 

Panko and Halverson (1996, January) define logic errors as incorrect formulae due 
to choosing the wrong algorithm or creating the wrong formulae to implement the 
algorithm. They involve entering the wrong formula because of a mistake in 
reasoning. Logic error rates are higher than mechanical error rates (Panko, 
forthcoming). Although there are several ways of subdividing logic errors, we use 
the classification of spreadsheet errors into Errors in Enabling Skills' and Errors 
in 'Planning Skills' diS proposed by Chadwick et al (1997, December). However, its 
use will be confined to quantitative, logic errors only. 

Logic errors in 'enabling skills' 

Enabling skills are those needed to permit the user full use of the functions and 
capabilities of the particular spreadsheet package in use, with an understanding of 
the spreadsheet principles, concepts, constructs, reserved words and syntax. This 
definition has been adapted from that given by Chadwick et al (1997, December). 
The two errors in enabling skills given below are taken from the same paper. 

Example 4: Relative and Absolute Copy Problem 

The relative copy causes cell references in a copied formula to alter row and 
column references relative to the original cell copied. People often make the false 
assumption that the software will automatically adapt the cell references wherever 
they happen to copy. 

Example 5: Circular References 

This error frequently occurs in totals where the formula uses its own value in its 
calculation. This error will give a run-time error message and so probably occurs 
infrequently. With reference to Figure 2, an example of this error would be the 
entry of the formula =SUM(D6:D10) into cell DIO, instead of =SUM(D6:D9). 
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Logic errors in 'planning skills' 

Planning skills are those required to analyse the business function in order to 
design the data model which is to be represented electronically by the spreadsheet 
model. These skills enable the user to identify business functions which are 
suitable for modelling with a spreadsheet and how this modelling is to be done. 
This requires thorough knowledge of business functionality and requirements for 
both the present and the future. This explanation of planning skills has been 
adapted from that given by Chadwick et al (1997, December). The three errors in 
planning skills given below are taken from the same paper. 

Example 6: The TOTALS Problem 

The error is said to have occurred when the column total and the row total are 
dissimilar when they should logically have produced the same result, often due to 
the lack of cross-checking. For instance, referring to Figure 2, the value in cell 
GIO should be the same whether the formula is entered as =SUM(G6:G9), which 
is the column total, or =SUM(E10:F10), the row total. This error can be easily 
picked up if the formula =IF(SUM(G6:G9)oSUM(E10:F10), “Error”, 
SUM(G6:G9)) is entered in cell GIO. An error message would then be displayed 
in the cell if the column and row totals were dissimilar. 

Example 7: A VERAGE Problem 

This happens when the average function AVERAGE(Rg) is applied incorrectly due 
to little understanding of its appropriateness. An example of this type of error 
would be the entry of the formula =AVERAGE(E6:F6) in cell H6, instead of 
=G6/D6, on the spreadsheet shown in Figure 2. 

Example 8: Percentage Problem 

This error occurs when the formula to calculate percentage is incorrectly written, 
either due to lack of knowledge of what a percentage is or BODMAS (Brackets, 
Of, Division, Multiplication, Addition, Subtraction) by which the spreadsheet 
identifies precedence in calculations e.g. B2/A2*100, B2*100/A2 or B2*A2/100 
instead of A2/B2*100 or A2*100/B2. This is based on Figure 4 below. 



ABC 



Night Wages £ 


Total Wages £ 


Night Wages % 


1400.00 


46940.00 





Figure 4 Percentage problem. 

Omission errors 

Omissions are things left out of the model that should be there. They often result 
from a misinterpretation of the situation. Human factors research has shown that 
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omission errors are especially dangerous, because they have low detection rates 
(Panko & Halverson, 1996). 



3 EXISTING TOOLS AND METHODS FOR INTEGRITY 
CONTROL IN SPREADSHEETS 

If we apply the generic information systems development life cycle to spreadsheets, 
then referring to section 2, it has been found that the different types of errors 
appear at different stages of the spreadsheet life cycle. There are various existing 
tools to help with integrity control at these different stages. Among them are the 
Microsoft Excel audit tool (Microsoft Corporation, 1994), the Spreadsheet 
Professional audit tool for Microsoft Excel by Spreadsheet Innovations, the 
DiAntonio method (DiAntonio, 1994) and Chadwick, D. et al’s 5-step 
methodology incorporating the 3A’s approach (Chadwick et al, 1997, December). 
This is illustrated in Figure 5. 



ERRORS 



Qualitative 




Logic 




Enabling 




Mechanical 


■ 


Omission 


Errors 




Errors 




Errors 




Errors 


1 


Errors 




DiAntonio 




Chadwick et al’s 


Method 




3A’s Method 



MS Excel 




Spreadsheet 


Audit 




Professional 


Tool 







METHODS AND TOOLS FOR INTEGRITY CONTROL 



Figure 5 



Errors and methods/tools associated with different stages 
of the spreadsheet life cycle. 







157 



Though these tools have to an extent reduced errors in spreadsheets, they have not 
been entirely successful as the phenomenon still persists. 

Microsoft Excel Audit Tool 

This tool which is part of the Microsoft Excel software enables the user to easily 
trace the precedents or dependants of any cell. The precedents of a cell are the 
cells referenced by it while the dependants of a cell are the cells that reference it. 
When tracing the precedents of a cell, an arrowed line is drawn from each 
precedent cell, pointing to the dependant cell. On the other hand, when tracing the 
dependants of a cell, an arrowed line is drawn from that cell pointing to each of its 
dependant cells. An example is given by Chadwick et al (1997, December). 

E.g. Referring to Figure 2, the precedents of cell H6 are cells D6 and G6, 
while the dependants of cell G6 are cells H6 and GIO. 

Apart from that, the audit tool also offers a facility for the user to attach a note or 
description to a cell. 

Spreadsheet Professional for Microsoft Excel by Spreadsheet Innovations 
This tool, which is an add-on to Microsoft Excel, has served as a rather useful 
auditing tool. It has various functions to help detect errors in the spreadsheet 
model. Among the significant functions are the calculation checker and the cell 
translation facility. 

Calculation Checker 

This function enables the user to view the contents, potential error and precedents 
of a formula cell. Based on the spreadsheet model shown in Figure 2, if a 
'calculation check' is done on cell HIO and the 'Show Precedents' option is 

subsequently selected, the precedents of cell HIO are displayed in the following 
style (note that this, however, does not show the relationship between them in the 
same display i.e. =G 10/D 10): 



Ref 


Translation 


Value 


GIO 


Grade Total 


291331.30 


DIO 


Grade Total 


25 



The translation for each precedent (cell or range) is given in terms of location, 
corresponding row heading and values. 
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Cell Translation 

The result of a 'Cell Translation' for cell ElO would produce information as given 
below: 



SheetllElO =SUM(E6:E9 ) 

Grade Total = SUM (Grade 1 :Grade 4 ) 

287930.75 = SUM (17700.50:102350.25) 



The formula in the cell is first translated into a form where each cell address is 
replaced with its corresponding row heading and then presented in a form where 
each cell address is replaced with its corresponding value. 

DiAntonio 's Method for Spreadsheet Development 

DiAntonio has proposed a structured method consisting of six distinct steps for the 
construction of spreadsheets. 

Step 1 : The problem is understood and defined. 

Step 2: Isolation of facts is done by splitting the spreadsheet into two parts, 
one for the facts and one for the solution. 

Step 3: The solution is formatted or designed and it uses data from the facts 
part of the spreadsheet. 

Step 4: The program is tested with sample data. 

Step 5: The program is evaluated in terms of functionality, headings, labels 
and format. 

Step 6: The program is documented either on the spreadsheet itself or in hard 
copy. 

Chadwick et al 's 5-step Methodology Incorporating the 3 A 5 Approach 
Chadwick et al (1997, December) propose a five-step methodology for spreadsheet 
auditing, that incorporates the 3A’s (appropriateness, accuracy, about-right) 
Approach. A brief outline of this methodology is given below. 

Step 1: Checking the appropriateness of the formula applied, from a logical 
point of view, based on the underlying business model. 

Step 2: Checking the accuracy of the formula entered based on a correct 
interpretation of the data model. 

Step 3: Checking if the resulting numeric value of the cell is about right. 

Step 4: Validating a formula copy to a cell or a range. 

Step 5: Modularising the spreadsheet by breaking it down into separate logical 
areas (modules). 




4 



A MODULAR APPROACH TO REDUCING ERRORS IN 
SPREADSHEETS 
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4.1 Rationale for a Modular Approach 

This approach applies the concept of modularisation to the design and structuring 
of spreadsheets. Based on section 1.2, it is evident that there is a need for the 
introduction of strict development disciplines in spreadsheeting, as well as the 
adoption of traditional programming or software engineering-based principles, 
concept and practice. The modular approach to spreadsheet development is a 
fundamental principle of software engineering. Support for the modular approach 
comes from DiAntonio (1986) and Chadwick et al (1997, December) but is weakly 
defined in both these sources (refer to section 2 for a brief outline of both these 
methods). 



ERRORS 
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Enabling 
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Mechanical 
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Omission 


Errors 
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Errors 
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Errors 


1 


Errors 




Requirements Logical Physical Operation Review 
Model Model 

^ The Modular Approach ► 



Figure 6 The modular approach covers all stages of the 

spreadsheet life cycle. 

Within the context of spreadsheeting, modularisation refers to the structuring of 
the spreadsheet model into distinct blocks or modules with data being passed 
between them. The elements of this approach are discussed in detail in the next 
section (4.2). 

Another fundamental principle of this modular approach is the concept of 
coupling. This is also a software engineering concept. Coupling may be defined 
simply as the link between modules resulting from the passing of data between 
them. This is also discussed in section 4.2. 
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4.2 A Modular Structure for Spreadsheets 

The modular approach dictates the division of the physical model (spreadsheet 
data) into distinct modules, with each cell value having a column heading and row 
heading associated with it. 

Most, if not all, models can, with minor adjustments, be made to conform to this 
structure. The fact that the spreadsheet is separated into separate blocks or modules 
suggests that a modular approach is being taken, based on an analysis of 
spreadsheet structure. 

The term we have given to a distinct module of the business spreadsheet is an 
extent. An extent can be defined as a matrix representing a logical area or module 
of the spreadsheet. An extent is a range with special properties. It has various 
special characteristics. 

The minimum size of an extent is a 2 by 2 range (4 cells). The first column of an 
extent contains the row headings while the first row of an extent bears the column 
headings. Every cell within a particular column (except the first column) is 
associated with the same column heading, which occupies the top cell of that 
column. Similarly, Every cell within a particular row (except the first row) is 
associated with the same row heading, which occupies the left-most cell of that 
row. 

Column headings and row headings of an extent must be defined by the user. No 
two cells can have exactly the same combination of column heading and row 
heading as there cannot be two or more column headings or row headings with the 
same name, although a column heading can share the same name with a row 
heading. 

The following steps must be taken in defining an extent and its boundary: 

Step 1 : 

All adjacent data cells within a column, that logically have the same column 
heading, are given a column heading which occupies the cell immediately above 
the first data cell of the column. It is possible for there to be only one data cell in 
the column. 

Step 2: 

All related columns of data are then identified and placed next to the first data 
column, to form a table. The top cell of each of these columns bears the column 
heading. The columns are only considered to be related if they are of the same 
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height and when put next to each other, every cell within a particular row must 
logically have the same row heading. 

Step 3: 

The row heading is entered into the cell to the left of the first cell of each row 
within the table. The entire matrix now is called an extent and represents a module 
or logical area of the spreadsheet. The spreadsheet model shown in Figure 2 is an 
example of an extent. The resulting generic structure of an extent, after these three 
steps are performed, is as shown in Figure 7 below. 
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Figure 7 Generic structure of an extent. 

With reference to Figure 7, 

• The boundary of the extent is defined in terms of its top-left cell (column q, 
row r) and bottom-right cell (column I, row m). 

• Column headings are contained in cells in the first row (row r) of columns 
q+1 to column 1. 

• Row headings are contained in cells in the first column (column q) of rows 
r+1 to row m. 

• Data values are contained in all the other cells except the top-left cell (column 
q, row r) 

The concept of coupling was introduced in the previous section. Within the context 

of this modular approach, there is coupling between two modules when a formula 

in one references data in the other. 

For a particular instance of coupling between two modules, the referencing module 

is the dependant and the referenced module is the precedent. Coupling in this 
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paradigm is the precedent-dependant link between cells in different modules. This 
is illustrated in Figure 8. For further clarification of these terms, refer to section 3 
(Microsoft Excel Audit Tool). 




Figure 8 Precedent-dependant link/coupling between modules. 

4.3 Visual Representation of Elements of a Formula in Natural 
Language Form 

As a by-product of the modular approach, it is possible to visually represent the 
elements of a spreadsheet formula. This is aimed at minimising errors in 
spreadsheets. 

Most of the errors that occur in spreadsheets are those concerning formulae. These 
are also the sort of errors that have the greatest impact on the business and its 
operations. Numerous types of errors can be made when entering formulae into 
cells, one of the most significant of which is the entry of incorrect cell addresses or 
references. 

When such errors are committed, it is often difficult to detect and correct them 
based on the original structure of the formulae that appears in the formula bar of 
the spreadsheet screen. This is primarily due to the use of cell addresses in the 
formulae to refer to data. 
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This problem can therefore be overcome if formulae were represented in a more 
visual, English-like and comprehensible form. This will certainly facilitate the 
validation and audit of spreadsheet formulae. The proposed technique for visually 
representing spreadsheet formulae will be able to present formulae in such a form. 
Any software tool used to implement this technique will be able to convert a 
formula written by a user in conventional form, expressed in terms of cell 
addresses, into a form that is more readable and visual. This is done mainly by 
displaying the corresponding column and row headings of each cell referenced by 
a formula. This makes every spreadsheet cell value meaningful and also ensures 
that the user understands this meaning when creating and using the spreadsheet. 
Several methods of presenting such formulae have been developed in this form. 
These different methods are given below based on the spreadsheet model in Figure 
2 . 



CD E F G H 





Number 
of Staff 


Day 
Wages £ 


Night 
Wages £ 


Total 
Wages £ 


Average 

Wage£ 


Gracto 1 


1 


17700.50 


0.00 




17700.50 


GraiteZ 


3 


45540.00 


1400.55 


46940.55 


15646.85 


Grade 3 


9 


122340.00 


2000.00 


124340.00 


13815.56 


Grade 4 


12 


102350,25 




102350.25 


8529.19 


Grand 

Total 


25 


287930.75 


3400.55 


291331.30 


11653.25 



Segment of Figure 2 

Referring to Figure 2, formulae are present in the following cells: 

G6 to GIO, H6 to HIO, and DIO to FIO. 



The formulae selected to illustrate the various methods are given below: 

CELL FORMULA 

FIO =SUM(F6:F9) 

HIO =G10/D10 

Method 1 : Algebraic English 

Cell Representation of the Formula 

FIO =SUM(Night Wages £_Gradel:Night Wages £_Grade4) 

HIO =TotaI Wages £_Grand Total/No of Staff Grand Total 
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This method simply converts each cell address to its corresponding column and 
row headings but retains the binary operators. 

Method 2: Fully English 

Cell Representation of the Formula 

FIO Night Wages £jGrand Total = SUM (Night Wages £_Gradel (to) 

Night Wages £__Grade4) 

HIO Average Wage £jGrand Total = Total Wages £_Grand Total 
(divided by) No of Staff_Grand Total 

This method converts each cell address to its corresponding column and row 
headings as well as each binary operator from symbol to natural language. 

Method 3: Graphic Display 



Cell: FIO 

Representation of the Formula: 
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Grand 


11653.25 




Grand 


291331,30 




Grand 


25 


Total 






Total 






Tot^ 





This is the most visual of the three methods. It is also the method preferred and 
recommended by the authors of this paper. Apart from associating each cell 
address with its column and row headings, this method also displays the value 
contained in the particular cell. In each display, a different colour is used for each 
different column heading and row heading. For instance, in the first example, all 
three cells shown have column heading (Night Wages £) in the same colour (blue) 
text to indicate they are all in the same column. However, the row headings 
(Grand Total, Grade 1 and Grade 4) are in different colours (red, green and 
purple) indicating that they are in different rows. 
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This approach may be successful in eliminating many of the errors referred to in 
section 2, particularly the following: 

i) Qualitative errors, especially formatting errors. 

ii) Quantitative errors, especially logic errors. A large number of 
mechanical errors can also be picked up through the use of method 4 
(section 4.1). 

4.4. Survey 

A survey was carried out to determine the preference of students to the visual 
methods described in section 4.3. The students were presented with four choices: 
the normal MS Excel formula style and the three mentioned above. They were 
asked to rank them in order of clarity and ease of understanding. There were 63 
respondents to the questionnaire. 46 respondents (73%) indicated preference for 
the visual methods. 21 of them (34%) chose method 3 (the graphic display) as the 
most clear and easiest to understand. It was the most appealing of the four choices, 
with the normal formula style (26%), the algebraic English (18%) and the totally 
English (22%). 

With the use of colour in the questionnaire and a better understanding of the 
benefits of method 3, it is possible that a lot more students would have selected it 
as their first choice. The findings, however, do endorse research of the visual 
technique, especially the graphic display. 



5 CONCLUSION 

This paper has covered the magnitude of the problem of spreadsheet errors, a 
classification of these errors and an appreciation of where errors occur in the 
spreadsheet life cycle. A study of existing tools and methods for integrity control 
revealed that the whole approach was piece-meal. From this, developed the need to 
investigate a new approach to the discipline of spreadsheet building based on 
software engineering principles. 

A modular approach encompassing the concepts of modularisation and coupling 
was discussed as a possible method of integrity control in spreadsheets. This in 
turn, gave rise to a visual technique to represent elements of spreadsheet formulae. 
All of these findings may now be integrated into a coherent spreadsheet 
development methodology. 
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Module 1 



Construct using 
The Modular 
Approach 



INTRA-Module 
Integrity Check using: 

• Chadwick et al’s 5-step 
Methodology 
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• Spreadsheet 
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Module 2 



Construct using 
The Modular 
Approach 



INTRA-Module 
Integrity Check using: 

• Chadwick et al’s 5- 
step Methodology 

• Visual Formulae 
Method (Section 4.3) 

• MS Excel Audit Tool 

• Spreadsheet 
Professional 



Figure 9 Spreadsheet development methodology. 

A Spreadsheet Building Methodology could therefore be as follows: 

Step 1 : Create a logical model 

Step 2: Create physical modules 

Step 3: Perform INTRA-module integrity checks 

Step 4: Perform INTER-module/coupling integrity checks 

The principal advantages of this approach are as follows: 

i. It may be useful in all phases of the spreadsheet life-cycle, as shown in Figure 

10 . 

ii. The industry trend is towards Graphical User Interface (GUI), visual languages, 
etc. 
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iii. The concept of modularisation is based on tried and tested techniques in use in 
software engineering and can be used to structure spreadsheets. 

iv. The visual technique could greatly help in the process of spreadsheet auditing, 
which is a key concern in many companies. 

V. It requires little training for users due to its simplicity. 

Further work needs to be done on: 

i. A methodology/tool for capturing requirements. 

ii. A methodology/tool to aid modular structuring of the logical model. 

iii. Creation of an add-in software tool that facilitates modular structuring of the 
physical, model which will aid intra-module and inter-module integrity 
checking. 

iv. A methodology/tool for reducing mechanical errors. 

V. A methodology/tool for auditing completed spreadsheets. 
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Abstract 

The paper studies errors in information system (IS) and loss of integrity in IS 
thereby. Specifically, the paper identifies accuracy, consistency and reliability as 
intrinsic integrity attributes that all information systems must satisfy. In 
networked computerized information systems, it is the inadequacy of application 
controls conceived at system design stage, and the possibilities of inherited errors 
and human errors that result in lapses in information integrity (II); in turn infesting 
IS with the possibility of massive risk of information pollution. The paper argues 
that these errors, which are made but not corrected, are due to factors drawn from 
system environment external to the application system and overlapping the user 
environment. This calls for automatic feedback control systems for on-line error 
detection (or estimation or prediction as the case may be) for integrity 
improvement in information systems. 
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1 INTRODUCTION 



Networked computerized information systems of today see ‘Data’ as raw material, 
‘Data Product or Information’ as processed data used to trigger certain 
management action, ‘Processing’ as the system function, and are characterized by 

(a) computing processes that include microcomputer and telecommunication and 

(b) pre and post-processing stage communication channels at various 
data/information processing nodes, that are people based and include data 
communication and transaction processing networks with world-wide reach. Such 
decentralized structure of IS has certainly facilitated organizations and individuals 
to work with shared data environments and with capture, use and control of 
growing, complex and diversified volumes of data and information; in turn 
affording business access to bigger markets. However, these networked 
computerized information systems make mistakes. It is this alarming reality that 
requires attention to questions of errors in information systems, of poor integrity 
of information systems and of finding methods, technologies and techniques for 
maintaining and improving integrity. 

2 ERRORS IN INFORMATION SYSTEMS 



Consider a conceptual representation of an Information System (IS) model as 
given in Figure 1 where < ej, aj, Vi > denotes a triple for data model (input to the 
information system) and < eo, ao, vq > denotes a triple for information model 
(output from the information system); < e, a, v > representing datum a triple 
< entity, attribute, value > as developed by the database research community. This 
representation which permits treating data/information as formal organized 
collection allows to segment integrity issues into issues concerning entities, 
attributes and values thereby making it feasible to study IS integrity analytically. 



Data Mcxiel 

<Ci, aj, Vi > 



t 



t 




Data Origin Communication Communication Output 

Channel Channel 



People and 
Medium 



where < ei, aj, Vj > denotes a triple for data model and < eo, ao, Vo > for 
information model. 

Figure 1 Conceptual Presentation of an Information System Model. 
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Each stage in this networked computerized information system model contains 
errors. To elaborate, Data origin stage comprises designing and operating data 
collection procedures and codes, form filling (leading to data generation), data 
collection, and data preparation along with machine operation (where data is 
converted into machine-readable form). Various errors during operating data 
collection procedures and codes concern incorrect, incomplete and ambiguous 
manual, inappropriateness of manual language to user, non-availability of manual 
and carelessness, and they result into use of wrong procedure or code. 

During filling in the data forms, errors encountered are in terms of ambiguous 
form filling direction, poor format, substitute or unauthorized person filling forms, 
poor motivation and once again carelessness, resulting into incorrect filling of 
forms. Coming to data collection stage, it is infected by errors caused by poorly 
designed forms and codes, poor handwriting and carelessness, resulting in 
omissions, inaccuracies, data in wrong place, loss of data, data manipulation, etc. 
Further, data preparation stage along with machine operation stage where data is 
prepared and converted into machine-readable form suffer from errors caused by 
poorly written keypunch instruction, poor procedures, hardware errors, poor 
maintenance, misrouting of documents, sabotage, carelessness; resulting in 
incorrect operations, fraudulent operations due to crime, data not processed on 
time, and machine breakdown. 

Coming to the pre and post-processing communication channels, growth of 
online databases, distribution of information system resources across 
geographically separate locations, user community spanning entire enterprise, 
increased use of data communication functions, and information management 
networks combining terminal access to information processing and database 
resources, electronic mail, office automation, word-processing, facsimile, and 
graphics have understandably made ‘communication channels’ a very significant 
component of an information system. And it so turns out that these 
communication channels are extensively prone to errors. 

To begin with unlike long distance voice communications and conventional 
radio broadcast, data communication content and performance are affected by 
inherent signal interference, usually referred to as communication noise thereby 
introducing errors in the data transmitted [Menkns, 1990]. Other causes affecting 
telecommunication and thereby introducing error in communication channel, 
relate to physical structure of telecommunication and to logical aspects of the data 
communication process itself. Specifically, causes relating physical structure of 
telecommunication (in addition to the factor of communication channel noise 
already mentioned) pertain to factors of electromagnetic signal radiation, circuit 
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switching deficiencies such as cross talk between circuits and the failure of 
mechanical or electronic components of the switch itself, and any interruption in 
the electrical power supply used by the switching facility. 

As regards to causes relating logical aspects of data communication, they 
include failure of the software used in either data communication network 
management or in data communication itself, and the size and inherent complexity 
of the network being used. In addition to these, communication channel also 
suffers from problems of theft of service and circuit tapping, acts of sabotage, 
incidents of accidental destruction and of effects of adverse weather and water 
caused damage. 

At the processing stage, this operation transforming ‘data’ into ‘information’ 
comprises machine operation, use of data files, use of systems and application 
programmes, and of processing operation itself. Errors during machine operation 
have been identified earlier. In respect of data files, errors caused are due to poor 
physical storage, lack of clearly defined responsibilities for data files, inadequate 
procedures, natural disaster and theft, fraud or sabotage; resulting in warped cards, 
dirty tape of disks, destruction of files, etc. 

During the use of systems and application software, errors caused are due to out 
of sequence programming, wrong algorithms, wrong programming instructions, 
poor documentation, lax security; resulting in incorrect solutions and unauthorized 
changes. And coming to the processing itself, perhaps the most significant cause 
for errors is carelessness in data processing; resulting in records lost and use of 
incorrect file. 

Finally, it is at the output stage that the user receives the ‘information’ which is 
the output of the information system. This ‘information’ constituting ‘output’ is 
used by the user as an aid in action or in management or in decision. This ‘output’ 
is the product of all input and processing under the information system model. 
Errors at this output stage are thus caused due to processing and operation errors; 
resulting in inaccurate and incomplete output. 

Figure 2 gives a systems view of the errors in an information system identified 
above. 

3 LOSS OF INTEGRITY IN INFORMATION SYSTEM 

Chamber’s dictionary gives the meaning of integrity as ‘entireness, wholeness, 
unimpaired stage of anything, purity’. Errors as above at various stages of an 
information system certainly result in loss of integrity at each of these stages as 
also in the loss of overall system integrity. Specifically, errors at data origin stage. 





Data Origin Stage 
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resulting into use of wrong procedures or codes, incorrect filling of forms, 
incorrect or fraudulent operations, data not processed on time, machine 
breakdown, etc. give rise to inaccurate, incomplete, backdated and insecured data, 
further threatened by loss of privacy in view of fraudulent operations. 

Similarly, errors (during communication channel stage prior to processing) 
caused by communication channel noise, physical structure of telecommunication 
and failures in logical aspects of data communication, circuit tapping and theft of 
service, acts of sabotage, incidents of accidental destruction and the 
unpredictability of the complex networks used, give rise to inaccuracy, 
incompleteness, loss of confidentiality and loss of privacy in data. Further, during 
processing stage, errors in machine operation, errors in respect of data files, 
application and systems software and errors in processing itself, give rise to 
inaccurate, incomplete, insecured data further threatened by loss of privacy. As 
pointed above, communication channel to post-processing stage also contributes 
to the loss of accuracy, completeness, confidentiality and privacy - this time of 
processed data i.e., information. 

Finally, errors at output stage also result in inaccurate and incomplete 
information. 



4 DEFINING INTEGRITY ATTRIBUTES : A HEURISTIC 

TREATMENT 

As can be seen, errors in information systems result in loss of integrity at each 
stage of the information systems, and thereby in the loss of overall system 
integrity. This integrity loss is in terms of the attributes (not to be confused with 
entity attributes referred to in this Section and paper) of : accuracy (purity), 
completeness (entireness, wholeness), data/information being up-to-date (i.e. 
timeliness implying accuracy inspite of time related changes in data/information), 
security and privacy (unimpaired meaning undamaged; purity). 

Let us consider data/information modeled as a triple < e, a, v > as suggested in 
Section 2 above. As explained earlier, this affords a very meaningful approach 
whereby integrity attributes for an information system can be considered by 
studying them (integrity attributes) for the components of the triple, i.e., entity, 
attribute and value [Redman, 1992]. 

To elaborate, a universe for a company may comprise ‘employees’, ‘products’ 
and ‘customer orders’. Employees and products represent ‘entity classes or types’ 
and customer order represents ‘relationship’ in the universe. For the purpose at 
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hand, an entity class, say employees class may be represented by attributes as 
follows: 

EMPLOYEE = (Employee Number, Name, Department Number, Salary, Date 
of Birth, Sex) 

Finally, a value provides information for specified attribute of a specified entity. 
For example, for employee entity, Albert, entity representation may be as follows: 
Albert = (94256, Albert, 9, $ 15000 p.a., 6.5.75, M) 

In such case. Date of Birth attribute has the value 6.5.75 and Salary attribute has 
the value $ 15,000 p.a. and so on. 

4.1 Accuracy, completeness and timeliness attributes 

With data/information model (in terms of triple) illustrated above, a clearer picture 
of integrity attributes could be obtained. For example, to study accuracy of the 
information on the company, one may study accuracy of the information on entity 
types, namely, employees and products and on relationship customer orders. 
Further, to study accuracy of information on entity type employees, one may study 
accuracy of information about attributes corresponding to entity type employees. 
Finally, to study accuracy of information about attributes, one may study accuracy 
of values for attributes, thereby making the exercise of studying accuracy of 
information on the company a viable exercise. For example, to study accuracy of 
information about salary attribute in the illustration above, one has to see if value 
of salary attribute $ 15,000 p.a. is correct or not. 

Similarly one can ascertain requirements of completeness and timeliness (i.e. 
accuracy inspite of time related changes) of information, both of which are 
‘necessary’ for requirement of ‘accuracy’, though not sufficient. 

4.2 Security (confidentiality) and privacy attributes 

Finally, from the point of view of information model, requirement of security, 
meaning undamaged information, is analogous to accuracy, as any damage to 
information, that is to say, to the value of an attribute will only result in 
inaccuracy of the value. 

Security also has an aspect of confidentiality. Further, security of information is 
also important from the point of view of privacy. However requirements of 
confidentiality and privacy, though they emerge as implications of errors in the 
information system, cannot be considered to be central requirements for all 
information systems, as there can be informations where confidentiality, and, for 
that purpose, security, and privacy may not be required. 
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Thus from the set of integrity attributes of accuracy, completeness, timeliness, 
security and privacy identified above, attribute of accuracy is central to an 
information system and attributes of completeness and timeliness are necessary 
for the attribute of accuracy. In other words, attributes of accuracy, completeness 
and timeliness are intrinsic to an information system irrespective of the use of 
information derived from the system. Against this, requirements of security in the 
sense of confidentiality and of privacy are optional to an information system and 
depend on the context and nature of use of information. 

4.3 Consistency and reliability attributes 

There are two other requirements that have not emerged in the integrity analysis 
so far and they are consistency and reliability of data/information. Information can 
be seen to meet consistency requirements if say attribute value satisfies the 
domain as well as the constraints on the value. Specifically, like completeness and 
timeliness requirements, consistency requirement is also a part of accuracy 
requirement, i.e., if data/information is accurate, then it is also consistent, but 
otherwise is not true. 

Coming to the requirement of reliability its origin may be seen in the very 
choice of the information system model, wherein system output, i.e., information, 
is defined as what user receives as an aid in action or in management or in 
decision. Here no user is defined as such, but ‘utility’ or ‘use’ role of information 
is brought out. It is in this context, that the requirement emerges that information 
obtained be reliable. What is discussed here, is perceiving reliability as an 
accuracy with which the information obtained represents the data item in whatever 
respect the information system processed it. 

Depending on the nature and type of information item, there can be methods and 
techniques for quantifying integrity attributes. For example, if one is talking about 
information item in respect of value of an attribute where, as in case of the 
example considered earlier, the attribute considered is ‘Salary’ which takes 
numerical value, then accuracy of information item on value of the Salary 
attribute at given month can be quantified in terms of say ‘difference between 
value obtained and correct value’ or in terms of ‘error ratio of actual error and 
acceptable error’ etc. But what if there is a problem of correct value of the 
identified source of the information item (also called standard) being undefined, or 
simply unknown. In situation an assumed standard itself may be incorrect as is 
often the case with data gathered some time in the past and with no corroborating 
evidence. In yet another situation there may be more than one correct value. Then 
there is a problem of how to quantify accuracy if the value does not lie on a real 
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line, i.e., it is not a numerical but say an alphanumeric or an alphabet or say even a 
picture or sound or multi-media presentation. 

Whatever is said here about accuracy attribute is also valid for completeness, 
timeliness, consistency and reliability attributes. However, immediate effort here 
is not to offer methods for quantifying integrity attributes, but to state significance 
and centrality of requirement of integrity attributes of accuracy, completeness, 
timeliness, consistency and reliability in respect of an information system. 

4.4 Intrinsic integrity attributes 

Thus, taking entire discussion together, irrespective of the nature of use of the 
information obtained from the information system, attributes of accuracy, 
completeness, timeliness, consistency and reliability emerge as intrinsic attributes 
that an information system must meet, while attributes of security and privacy are 
optional depending on the context and nature of use [Mandke, 1996; Mandke and 
Nayar, 1997]. 

There is more to the intrinsic integrity attributes mentioned above. As pointed 
out earlier, attributes of completeness and timeliness are necessary for accuracy. 
That is to say, when checked for accuracy, the value of the information item also 
gets checked for its completeness and for it’s being up-to-date (timeliness), as 
accurate value has to be complete and timely. In that sense, it is sufficient to check 
for accuracy only. 

Similar is the situation in respect of consistency, too, as an accurate value also 
has to be consistent. However, difference is that, as mentioned earlier, consistency 
check is in terms of domain values and in terms of constraints without referring to 
real world objects and, therefore, a simpler and less expensive task offering first 
approximation on accuracy and, when checked in addition to accuracy, increasing 
overall reliability of integrity checking process itself 

It is within the above framework then accuracy (includes completeness and 
timeliness), consistency (satisfying constraints) and reliability (accuracy with 
which information item represents data item in whatever way information system 
processed it) emerge as intrinsic or basic or objective attributes of information 
integrity. As mentioned earlier, what one is considering here is an information 
system model which delivers Information for use. Therefore, depending on IS 
application area and industry standards, without reference to any specific user 
there may be different requirements/standards of information. For example, 
although accuracy of information is intrinsic and central to the information system 
performance, in many application areas irrespective of who is the individual user 
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it is not uncommon for four digits to the right of the decimal place to be rounded 
off without loss of much information. Thus, even though the intrinsic integrity 
attributes of accuracy, consistency and reliability emerge as basic integrity 
attributes all information systems must demonstrate, depending on the application 
area, each of these intrinsic integrity attributes may satisfy different application 
area specific industry standards [Mandke and Nayar, 1997]. 

4.5 Extrinsic integrity attributes 

As observed earlier, depending on the context and nature of use, there would also 
be user specific integrity requirements, namely, security and privacy and these 
emerge as extrinsic or subjective attributes of Information Integrity. There are 
other subjective attributes of Information Integrity, too, as reported in the 
literature based on user survey in respect of data/information requirements 
[Delone and Mclean, 1992; Mandke, 1996]. These are : usability, independence, 
precision, relevance, sufficiency, understandability, freedom from bias, 
conciseness, brief, trustworthy, etc. 

Just as in the case of intrinsic or objective attributes of Information Integrity, 
there is also a question of defining and, where possible, quantifying these extrinsic 
or subjective integrity attributes. But then this query is beyond the scope of the 
query at hand in terms of developing a design basis for achieving Information 
Integrity. 

Figure 3 presents a system’s view of the integrity implications for a networked 
computerized information system. 

5 INFORMATION POLLUTION - A DIRECT IMPLICATION OF 
LOSS OF INFORMATION INTEGRITY 

Traditionally, errors in information system (IS) and resulting integrity 
implications have been addressed as if they were preventable or correctable by 
building controls within application system. With time there have been efforts to 
put in greater inputs right at the system analysis and design stage, hoping that 
would ensure Information Integrity [Hussain, D. and Hussain, K.M., 1984]. 

5.1 Inadequacy of controls v^ithin computing system 

However, reality is different. Main reason for this is ironically the inadequacy of 
controls designed to meet lapses in integrity at different IS stages. To elaborate, 
with reference to data collection procedures and codes, in view of shared data 
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environments, in view of user community spanning entire enterprise and in view 
of human element of carelessness, it is extremely difficult to avoid situations like 
manual unavailable when needed, use of unauthorized manual, user inability to 
understand manual language, resulting in the use of wrong procedure or code and, 
therefore, in inaccurate and insecured data inspite of controls. Coming to the form 
filling stage, situations of poor motivation, carelessness and substitute or 
unauthorized persons filling out forms continue, resulting in incorrect filling of 
forms. As regards to data collection stage, errors caused due to carelessness are 
very difficult to eliminate completely. Similarly, controls at data preparation stage 
are not sufficient to eliminate error implications of carelessness in data 
preparation. Further, it is also extremely difficult to ensure controls like employee 
selection and training for foolproof results. 

In respect of machine operation stage, errors caused particularly due to human 
behaviour like carelessness of operators, sabotage, desire for personal gain, 
documents lost or misrouted are difficult to eliminate. Further, once again, it is 
difficult to have controls like upgraded personnel selection or training that are 
foolproof Similarly at communication channels at pre and post- processing stages, 
inspite of all communication controls, errors caused by failure of data 
communication software, channel noise, and unpredictability of data 
communication network due to its size and complexity, not to speak of adverse 
influence of people and weather, are difficult to eliminate. Also controls during 
machine operation at preprocessing stage and their inadequacy are same as that for 
machine operation following the data preparation stage discussed earlier. 

Coming to data files, controls considered are controlled humidity storage, ‘clean 
room’ conditions, special cabinets, periodic cleaning, centralized storage under 
librarian, upgrading of storage procedures, backup data and controlled access to 
files. However, these controls cannot completely remove causes of theft, fraud or 
sabotage which stem from human behaviour. All controls in respect of systems 
and application software do not ensure removal of all errors, particularly those 
caused during programming and those due to poor documentation and lax 
security. Also all controls at processing stage cannot remove errors that are caused 
by carelessness. Finally, controls at output stage cannot always eliminate all 
operation or processing errors [Mandke, 1997]. 

It is these errors at different IS stages that (due to inadequacy of application 
controls) are made but not corrected, that then render an IS infested with 
information having poor integrity. 
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5.2 Inherited errors 

Rapidly evolving technologies producing distributed information networks and 
shared data environments that are common place today pose yet another issue. 
Once an error occurs it can be considered an ‘inherited error’ [Banks and Weimer, 
1992] if it is passed along to another computer, network, database, file, or the like. 
Inherited errors occur when an error is propagated beyond the system in which it 
originated. For example, if a personal computer on a local area network is used to 
prepare a report - and erroneous data is incorporated into that report ~ when the 
report is submitted to another computer system, an error is inherited [Wood and 
Banks, 1993]. Once again, various controls that are considered and implemented 
cannot remove these inherited error possibilities. 

5.3 Human error: a signiHcant integrity problem 

As one critically considers the above discussion, amongst others, factors of 
carelessness, poor motivation, and other actions of people emerge as most 
significant factors contributing to errors in information system and, thereby, lapses 
in integrity. According to a study performed by the Executive Information 
Network [Bloombecker, 1989], 55% of the respondents involved in a survey 
considered human error as the most important integrity threat. Ironically, human 
error can also be one of the most serious problems causing system interruptions. 

While the frequency of human error and the opportunity for human error are 
important considerations, the magnitude of the loss due to human error is also a 
major concern. Contrary to the general opinion of many information system 
practitioners, human error is not always a low-consequence threat. In fact, in 
terms of money lost, human error is the largest single cause of economic and 
productivity loss in the information system integrity arena. As an evidence of this, 
consider a study reported in Computerworld [Annon, May 1987], which attributes 
52% of corporate information damage to human error. Single-incident losses can 
also be significant. And application controls cannot remove these human error 
possibilities, as can be seen from the feedback on error occurrences and their 
consequences as available from the field. 

As computer systems become increasingly networked, as applications 
increasingly share data with one another, and as databases become increasingly 
distributed, the risk of data/information error - including risk of inherited error - 
increases, which amounts to the issue of data/information pollution. 
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6 FACTORS EXTERNAL TO COMPUTING (AND APPLICATION) 
SYSTEM RESPONSIBLE FOR LOSS OF INTEGRITY 
IN INFORMATION SYSTEM 

As discussed through previous Section, errors in information systems are caused 
by factors not amenable to controls including application controls conceived at 
system design stage itself. Of course, literature reports research efforts in terms of 
identifying foolproof information requirements [Kliem, 1992; Mostert; 1994; 
Tompkins and Rice, 1996], but design experience shows this is something not 
easy to achieve. 

This is because, due to factors detailed here, computerized information systems 
invariably have errors that are made but not corrected by the controls incorporated 
at system design stage. As can be seen, these factors, invariably have their 
presence mainly through the system environment which is external to computing 
(and hence the application) system and overlaps the user environment, though 
together they (the computing system and its external environment) constitute the 
Information System Model as in Figure 1. Inspite of application controls, it is 
these external factors that then make information systems give rise to information 
which is inaccurate, inconsistent and unreliable [Mandke and Nayar, 1997; 
Nayar, 1996]. 

These external factors could be categorized into five major categories; namely, 
change, complexity, communication, conversion and corruption. 

Change 

Change may occur either in the content or in configuration of the system 
environment, resulting in a possibility of error introduction in the information 
system. Every hardware change, software release, and organizational change will 
come under this category, offering cause for error and, therefore, for a possibility 
of an inaccurate, inconsistent or unreliable information. 

Complexity 

Whenever one introduces complexity, there is a possibility of error introduction in 
the information system. Every new component, be it a programme, database or a 
network, adds new interfaces increasing the possibility of error introduction. 
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Communication 

Communication stands for movement of data/information within or across 
enterprises and it also provides a chance for error introduction. 

Conversion 

Conversion, in this context, refers to the consolidation, decomposition or 
transformation of data. Whenever one converts data from one form to another, 
there exists a possibility of error introduction, resulting in information which may 
not be accurate. 

Corruption 

Finally, corruption pertains to human behaviour (poor motivation, desire for 
personal gain, carelessness, actions of people), to factors leading to inherited 
errors polluting the information systems, and to unpredictability (noise) of any 
kind leading to introduction of errors in computerized information systems. 

Whether, in addition to controls discussed above, computerized information 
systems also incorporate human engineering design criteria at the system design 
stage itself or hardware and software vendors further incorporate error-checking 
filters into their products, it is these external error factors that then have to be 
addressed, if one were to resolve the question of errors in information system 
model so as to obtain information which is accurate, consistent and reliable. As 
observed by Svanks in her significant work entitled ‘Integrity Analysis : A 
Methodology for EDP Audit Data Quality Assurance’ published in 1984 [Svanks, 
1984], an information system could be viewed as a production line in a 
manufacturing environment. Processing stage represents logic steps which utilize 
input transactions as raw material or parts to yield processed data base records, i.e. 
information, as the end product. A typical production line incorporates process 
control, but more importantly, also employs product control. The identification of 
faulty processes alerts product quality control to invoke special procedures such as 
tightened inspection, repair or discarding of finished goods. Conversely, the 
disclosure of substandard products suggests remedial action for specific processes. 

Information System testing then becomes the equivalent of process quality 
control in that the errors revealed call for software revisions and maintenance. 
However, if the information system is already operative, a test result indicating 
error only suggests that such error might have occurred in the past. As test 
procedures do not access the production data base, no statement can be made as to 
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whether or not the information system error has occurred in real environment, nor 
which records have been affected by the erroneous process. And, therefore, the 
confirmation of potential or suspected anomalies on a live data base and 
subsequent integrity improvement becomes an essential facility (beyond 
application controls) within an information system. 



7 NEED FOR AUTOMATIC FEEDBACK CONTROL SYSTEM 
FOR ON-LINE ERROR DETECTION AND INTEGRITY 
IMPROVEMENT IN INFORMATION SYSTEMS 

In the information systems, the need is then to remove errors that are made but not 
corrected. When abstracted this implies, in the Information System Model in 
Figure 1, given that data/information is represented by a triple <e, a, v> and 
considering a particular example where say an output, i.e., processed data, i.e.. 
Information is represented by entity class, namely, employees and where specific 
entity (e) under consideration is an employee by name Albert and where specific 
attribute (a) under consideration is Albert’s Salary, then, by virtue of on-line 
errors present in the information system, at any time, there exists a possibility of 
information item on value (v) of Albert’s salary being inaccurate, inconsistent or 
unreliable, i.e. it’s being affected by error or say corrupted by noise, and, 
therefore, a more realistic representation of value (v) is (v + tj), where t| 
represents noise or error component [Mandke and Nayar, 1997]. 

It is within this framework of error implications on data/information model 
wherein triple < e, a, v > is replaced by triple < e, a, v + ri> and, as discussed in 
Section 2, considering that these error implications are present at each stage of an 
information system; namely, data origin stage, communication channel prior to 
processing stage, processing stage, communication channel at post-processing 
stage and output stage, a modified version of a conceptual schematic of an 
Information System Model in Figure 1 emerges, accounting for errors that are 
made but not corrected. The same is given in Figure 4. 

It is these on-line errors that are then to be removed. For this purpose, one has 
to first detect errors and then correct them. In other words, what is required is to 
incorporate on-line learning and error correcting mechanisms in the Information 
System Model. Specifically, this calls for automatic feedback control systems with 
error detection and correcting technologies for improved information accuracy, 
consistency and reliability; technologies that maximize integrity of information 
systems - Information Integrity Technologies. 
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8 INFORMATION INTEGRITY TECHNOLOGY DESIGN 

Rajaraman [Rajaraman, 1996] points out that integrity of the overall information 
system as in Figure 1 is ensured if the integrity of all parts of the system are 
ensured (see Figure 5). 




T 



Integrity 

Figure 5 Conceptual Presentation of Integrity of an Information System. 
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As a result, each stage and its intermediate stages in the information system as 
discussed in Section 2, along with the overall information system (with data as 
input and with processed data, i.e., information for use as output) considered as a 
black box, become the candidates for incorporating automatic feedback control 
systems or application and user specific software products constituting 
Information Integrity Technologies as above. 

What Information Integrity Technologies will have to do is to follow data as it 
originates, moves over communication channel at pre-processing stage and gets 
processed, and follow processed data, i.e., information as it moves over 
communication channel at post-processing stage and gets used by user (as 
information system output), so as to detect where the error(s) occurs. Based on 
this, the Information Integrity Technologies would then need to take corrective 
action to remove error(s), so as to improve integrity of information system 
stage(s) under consideration and of the overall information system. In doing so, it 
would also be helpful if it is possible to measure integrity of a stage in the 
information system and of the overall information system, thereby offering a 
measure of integrity improvement achieved and a statement of improved level of 
integrity. Most importantly, such detection and correction of error(s) and 
improvement of integrity along with measurement of improvement and statement 
of integrity level achieved, will facilitate demonstration of integrity of 
computerized information systems rather than mere trusting them in the context 
[Mandke and Nayar, 1997]. 

However, apart from difficulties that are obvious in designing and developing 
such automatic feedback control systems leading to Information Integrity 
Technologies and in measuring integrity (i.e., accuracy, consistency and 
reliability) of information, the most important difficulty in perceiving such 
technologies is that it is impractical to follow, i.e., track and analyze every bit of 
data/information for all times as it flows through the information system stages. 

Way out here is to consider Information Integrity Technology that takes a 
sample of input data at the output or at an intermediate point of an appropriately 
identified stage or sub-system of the information system and then follows or keeps 
track of the sampled records at output or intermediate points of subsequent stages 
(sub-systems), at a given point of time or at different points of time over a 
required time interval. Records so obtained at a given stage (sub-system) could 
then facilitate study of patterns of errors at that stage (sub-system) in the 
information system. 

Utilizing the records available for different points of time over the time interval, 
error patterns can be studied even for a time variation. Based on patterns of errors 
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for a given stage (subsystem) so analyzed, causes of errors in the sampled 
data/information can be known so as to obtain corrective action (probabilistic 
mechanisms included) to eliminate the causes, remove the error and improve the 
integrity. In other words. Information Integrity Technology could comprise 
sampled data control system for the stage (subsystem) under the information 
system as also for the overall information system. 

Figure 6 presents a block diagrammatic view of the Information Integrity 
Technology Design thus emerging. 

Literature extensively refers to requirement of security in information systems. 
However, security meaning concern with controlling dissemination of information 
(confidentiality) is different from integrity of information having concern for 
accuracy, consistency and reliability of information in a computer system - 
requiring control over modifications made to information. As a result security and 
integrity require separate mechanisms. Further, the concept of ‘Trusted computing 
base and procedures’ implying an information system can be trusted over time 
without ability to provide the evidence that the trust is well placed is incompatible 
with internal control principles and therefore insufficient for the question of error 
free computerized information system. Also with networks becoming integral to 
information systems, there is need to recognize presence of ‘noise’ in 
data/information models; thereby conceiving probabilistic descriptions of 
information flow models and of on-line Integrity Improvement Mechanisms 
[Mandke and Nayar, 1997]. 



9 CONCLUSION 

Errors in computerized information systems were relatively manageable as long as 
there was homogeneous system environment and centralized control over 
information. Emerging trends of globalization, changing organizational patterns, 
strategic partnering, electronic commerce and distributed computing have changed 
all this, posing risks to accuracy, consistency and reliability of information. 

These intrinsic information integrity attributes are central to any information 
system in that in their absence the information systems will have massive amounts 
of polluted (error-filled) data and useless, even dangerous information. 

These errors are essentially caused by on-line factors of change, complexity, 
communication, conversion and corruption which have their presence mainly 
through system environment which is external to computing (and hence the 
application) system and overlaps with the user environment. Inspite of application 
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controls, it is these external factors that then introduce, in information systems, 
errors that are made but not corrected. 

Need therefore is to design and develop automatic feedback control systems 
leading to application and user specific software products representing 
Information Integrity Technologies that (a) carry out on-line detection (filtering 
problem), estimation and prediction of errors contributing to loss of accuracy and 
consistency and of causes contributing to loss of reliability, and (b) implement 
integrity improvement action plan (probabilistic mechanism included) 
accordingly. It is such Information Integrity Technologies that could then also 
provide a measure (metric) of Information Integrity achieved, in turn facilitating 
demonstration of integrity of computerized information systems rather than 
merely trusting them in the context. 
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Abstract 

In a conventional database management system, integrity constraints are de- 
fined a priori and are static in nature. However, in many cases, real-world 
data is often unpredictable and evolves over time. In order to reflect these 
changes, there may be a need to modify the constraints. Moreover, it is quite 
natural that some data may violate the originally defined integrity constraints, 
but yet there is a need to store such exceptional data in the database. This 
is because, the schema may be ill-designed, or the world has changed since 
the design. Therefore, in order to capture the real-world situations, constraint 
modification is required in many systems. In such systems the constraints 
evolve based on the knowledge derived from the data and from the excep- 
tions. In this paper, we show how such constraint refinement can be carried 
out through knowledge discovery mechanisms. We use an attribute-oriented 
generalization technique to derive knowledge. 

1 INTRODUCTION 

Constraints are defined in a databaise system so as to protect the integrity 
and consistency of the information. These are fundamentally invariant prop- 
erties of a database state. In other words, to be in a consistent state, the 
contents of the database must adhere to all of the constraints. There are two 
fundamental types of constraints, those that are properties defined by the 
particular data model and those that reflect the data semantics associated 
with a specific application. For example, key constraints, domain constraints, 
referential integrity constraints, enforced by a relational databcise system be- 
long to the first type of constraints. Semantic constraints range from simple 
type constraints, to more complex expressions, to arbitrary rules or policies of 
an organization. To assure that a database is consistent, the system enforces 
the constraints when any modifications to the data are made. Therefore, in 
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a traditional database system, the information entering the database has to 
conform to the constraints set by the database designers. 

However, as the real world is quite unpredictable, irregular and ever chang- 
ing, the database system also has to be refined to accommodate these changes. 
Moreover, while designing the database schema, the designers may not have a 
complete understanding of the data and hence as they acquire new knowledge 
from data, they might want to modify the schema. Especially, in a design 
environment, it is not possible to perfectly define all the constraints at the 
initial stages of the design. Thus, any database system must have some de- 
gree of flexibility in order to fully support the activity of its users. Constraints 
which are part of the schema, also have to evolve depending on the knowledge 
derived from the data. 




Figure 1 The System Architecture 

More explicitly, modification of constraints in a system is required because 
of two reasons. (1) As the data is ever changing, new knowledge may be derived 
from this new data and it may be necessary to change some of the constraints 
initially set by the database designers to suit to the new environment. Hence 
by learning from the changes in data, constraints have to be modified, if 
necessary. (2) Suppose the schema is ill-designed or it no more suits to the 
changed data, the only way to detect that it is so, is to accept and store 
the exceptional data that violate the constraints. Hence, in order to capture 
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the irregular behavior of the real world, the database system has to accept 
exceptions. If the exceptional data is enormously large, one has to modify the 
constraints. 

This paper analyzes the situations in which constraint modifications are 
required and describes methods to carry out these modifications. As shown in 
Figure 1, the system accepts exceptions and stores this in logically separate 
files. The data and the exceptions are fed to knowledge discovery mechanisms 
to discover knowledge (in the form of rules). The discovered knowledge (K) is 
fed back to the system and is utilized to refine the schema. 

Learning from the changes in data and from the exceptions, the system 
suggests the required modifications to the database schema in order to suit 
to the new data. Thus schema evolution involves refining constraints, adding 
or deleting new attributes, discovering new class hierarchies, etc. This paper 
deals with the evolution of constraint-base and it discusses the mechanisms 
to modify the constraint-base. The modification process only suggests a set 
of modifications to the constraints and outputs the same on request or when 
required. The process is not fully automated and human intervention is re- 
quired to materialize the actual modification. Thus the system acts as an aid 
to the database administrator in refining the constraints. 

This paper is organized as follows. The remainder of this section investigates 
the prior work and briefly describes the model assumed in this paper. In 
section 2, some motivating examples that emphasize the necessity to refine the 
constraints are enumerated. In section 3, the various types of constraints and 
the possible modifications in each of the type are described. The mechanism to 
derive knowledge from data is presented in Section 4. Section 5 and section 6 
discuss the techniques to modify the constraints by learning from the changes 
in data and from exceptions, respectively. The last section points out the 
limitations of this paper and discusses the required future improvements. 



1.1 Prior Work 

In (Shepherd & Kerschberg 1986), the authors emphasize that constraint 
management is an essential part of a knowledge base system for manag- 
ing both data and knowledge. A concise declarative language for express- 
ing the constraints is provided in (Morgenstern, Borgida, Lassez, Maier & 
Wiederhold 1987). In (Vianu 1983), the author discusses the evolution process 
of functional dependencies that are considered as part of the constraint-base. 
The database schema evolves as new functional dependencies (dynamic con- 
straints) are derived from the changes in the data over time. (Borgida 1985) 
points out the necessity to allow exceptional data into the database even 
though they do not conform to the integrity constraints set by the database 
designers. It discusses numerous examples that occur in real life which are 
considered as exceptions. It emphasizes that they have to be accepted by the 
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database system and must be stored in order to capture the real life situations. 
It also discusses the problems encountered in handling these exceptions and 
the methods to resolve them. In (Borgida, Mitchell &: Williamson 1986), the 
authors describe two methods to derive knowledge from the exceptions: em- 
pirical generalization and explanation-based generalization. When prompted 
by the database administrator or when sufficient evidence is accumulated, the 
system suggests alternative changes to the schema. 

(Cai, Cercone & Han 1991) gives methods to derive knowledge from large 
relational databases. This presents algorithms to derive characteristic rules 
and classification rules from the database using an attribute-oriented induc- 
tion approach. These methods are based on the techniques in “AQ15: In- 
cremental Learning of Attribute-Based Descriptions from Examples” (Hong, 
Mozetic & Michalski 1992, Michalski, Brakto & Kubat 1997). Discovering 
classification rules has been implemented in INLEN(Kaufman, Michalski & 
Kerschberg 1991). 

Our approach to modification of constraints using exceptions is similar to 
that in (Borgida 1985). While the main contribution of this paper is the 
methodology to modify constraints due to the changes in normal data; we 
present that due to exceptions for the sake of completeness. 



1.2 Model 

The data model assumed in this paper is a relational data model. It is im- 
portant to note that the techniques developed in this paper are applicable 
to other data models such as object-oriented. Constraints are expressed as 
first-order logical expressions. 

We use first-order predicate calculus as the primitive language for knowledge 
discovery from databases. From the logical point of view, each tuple in a 
relation is a formula in conjunctive normal form. For example, the following 
tuple. 



Tname 


Dept 


Cadre 


TDegree 


Country 


Paul 


CS 


GTA 


B.S 


U.S.A. 



represents a logic formula, 

3x ((Tname(x) = Paul) A (Dept(x) = CS) A (Cadre(x) = GTA) A (TDegree(®) 
= B.S.) A (Country(x) = U.S.A.) 

The methodology developed in this paper utilizes a conceptual hierarchy of 
attribute values. It assumes that the conceptual hierarchy is provided either by 
human experts like knowledge engineers or domain-specific experts or derived 
automatically by a conceptual clustering algorithm such as CLUSTER/2 in 
(Michalski & Stepp 1983). The concept hierarchy consists of different levels of 
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U.S. Degrees 




Ph.D M.s B.S 



Figure 2 Conceptual Hierarchy 



concepts organized as a partial order according to a general-to-specific order- 
ing. The most specific concepts correspond to the specific values of attributes 
and the most general concept is a null description. For example, {Ph.D, M.S, 
B.S} C U.S. Degree represents a conceptual hierarchy, which can be shown as 
in Figure 2. Here {u4i,...i4„ } C B indicates that H is a generalization of 
i4,, for 1 < 2 < n, that is, Ai IS- A J5. For the sake of simplicity, this paper 
assumes that the conceptual hierarchies that represent the IS-A relationships 
are in the form of a tree. In other words, it assumes that there is no multiple 
inheritance. 



2 SOME MOTIVATING EXAMPLES 

This section presents two categories of examples: (1) examples that explain 
the necessity for the modification of the constraints based on the knowledge 
derived from data and (2) examples that emphasize the need to store excep- 
tional data. 



2.1 Dynamic constraints-Examples 

The following two examples illustrate the necessity to have dynamic con- 
straints in database systems rather than static constraints. These examples 
show that the constraints have to be changed based on the changes in data 
from time to time. 

1. Let us consider a University Database and assume that it maintains data 
about Career development. Also assume that there is a constraint that all 
students with CS major have to take a predefined set of core courses (say 
ci,C2 and C3). This constraint can be expressed as follows: 

"i(x) (student (x) A major(x) = CS => coursei(x) = ci A course 2 (x) = 
C2 A coursez(x) = 03) 
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Suppose we derive knowledge from the data in Career development which 
says that all students of CS major who have taken a particular course C4 are 
fully employed. Then we might want to modify the constraint by including 
this course also into the set of core courses. Thus the constraint has to be 
modified as follows: 

"i{x){student{x) A major{x) = CS => coursei{x) = ci A course 2 {x) = 
C2 A courses (x) = C3 A course4{x) = C4) 

This example suggests that there is a need to modify the constraints in the 
system depending on the knowledge derived from the data. 

2. Consider the constraints encountered in a simple banking schema. One of 
the integrity constraints that most of the banks observe is, 
balance > 0. 

That means, if any customer tries to withdraw more than his/her balance, 
then that transaction would immediately be aborted. In some cases the 
bank my even fine the customer for doing so. If the customer has a very 
good history and if he is a valuable customer to the bank, then the bank 
might want to allow that customer to over draw. To facilitate this, it has 
to change the integrity constraint to either 
balance > —100, 

or, it might have to include another attribute such as history to the con- 
straint which indicates whether he/she is a “good ” customer or not and 
change the constraint to 
If ( history = good ) then balance > —100. 



2.2 Exceptions in databases 

In most practical situations, there will occcisionally be information which a 
user wants to store, even though it contradicts the constraints defined in the 
schema. A database should have the flexibility to accommodate such special 
cases. Let us consider the database of an organization. Here, some examples 
that are treated as exceptional information are enumerated. The examples are 
from (Borgida & Williamson 1985). 

1. Suppose there is a constraint that ‘‘no person should earn more than lOOK.” 
But if few persons in the organization may have salary of more than lOOK, 
the system immediately rejects such updates. If a higher bound is used in 
the constraint, then the error checking capability of the constraint will be 
lost. 

2. Let there be a domain constraint on the attribute degree of each employee: 
degree => {HSGD, BS, MS, PHD}. Some employee may have received a 
foreign degree which is not specified in the domain and is not equivalent to 
any American degree. It is important to store such a value in the database 
as there may be some decisions based on this. On the other hand, it is 
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not possible to add all possible foreign degrees at the time of designing the 
schema. 

3. There may be a constraint which says ‘‘no employee should earn more than 
his/her supervisor.” But in some special cases, a person may earn more 
than the supervisor especially when the supervisor is either on leave or a 
part-time employee or if the employee has worked over-time. 

4. Suppose the system encounters a person outside the U.S.A., whose address 
is not in the standard form street, city, state, zip-code but might be having 
different attributes. In such a case, it is not practical to describe all forms 
of addresses in the world in the database but still one would like to store 
such data. 

All these examples show that exceptional data is quite common in real-world 
and the database should be able to accommodate such occasional exceptions. 

The next section discusses the various types of constraints encountered in 
database systems and the possible modifications that can be done to each 
type of the constraint. 



3 CONSTRAINT MODIFICATION 

Constraints in the database system are of many categories: constraints defined 
by the model and the constraints defined by the semantics of the data. In this 
context, as we are concerned with the semantics of the data, refinement of 
the constraints is due to the changed environment and circumstances of the 
real-world. To capture these changes and to suit to the new requirements, 
the constraints defined by the semantics of the data have to be changed. 
Modification of constraints is possible in either of the two ways: by relaxing 
the constraints or by strengthening them. 

Semantic constraints that are encountered in a system can be of various 
kinds: (1) type constraints, (2) domain constraints, (3) value constraints, and 
(4) complicated rules defined in the system expressed as logical expressions, 
etc. Constraints also include the IS- A relationships as well as functional de- 
pendencies. Type constraints declare the type of a value that can be assigned 
to an attribute. Relaxing or strengthening such a constraint could be changing 
the type to a more general type or to a more specific type, respectively. Do- 
main constraints specify the domain of the values each attribute can assume. 
Relaxing such a constraint is simply increasing the region of the domain and 
strengthening such a constraint is to constrict the domain. A range of values 
can be introduced or the length of the string can be increased in order to relax 
a constraint. Value constraints can be defined over one or more attributes and 
they may also contain logical expressions. A value constraint can be of the 
form, 

< expression >< comparison operator >< value >. 
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The < expression > can be either an arithmetic or a logical expression. 
Rel^lxation(strengthening) of such a constraint is to increase(decrease) the 
range of the values satisfied by the expression either by increasing or decreas- 
ing the < value >. 

Finally, rules can be expressed as 
if < expression > then < action >. 

The < expression > here is a logical expression. Relaxing a constraint is 
carried out by including a new literal in the following form: 

< expression V new literal >. 

Similarly, strengthening a constraint involves inclusion of a new literal such 
that, 

< expression A new literal >. 

As described earlier, modification of constraints is performed in two cases. 
First, knowledge derived from the exceptional data may lead to the refinement 
of constraints. Second, new knowledge derived from the changed data, may 
lead to the modification of constraints. 

If there is modification of a constraint because of the exceptions, it obviously 
results in relaxing the constraint. After the modification, the exceptions are 
no longer considered as exceptions but are considered as normal data. 

However, if there is a modification of a constraint due to the knowledge 
derived from the new data, this modification might be either to relax the 
constraint or to strengthen it. 

While refining the constraints, the consistency of the constraint-base has to 
be maintained. That means, no new constraint should invalidate an already 
existing constraint. In other words, the system of constraints should not yield 
a logical contradiction. While this paper does not address the issue of resolving 
such conflicts, interested readers may refer to (Yoon & Kerschberg 1993). 

4 THE KNOWLEDGE DISCOVERY MECHANISM 

This section describes a knowledge discovery mechanism from data stored as 
a relational database. This is an attribute-oriented, tree-ascending general- 
ization technique, which adopts the AI learning techniques of ‘‘learning from 
examples.” In order to derive knowledge from data one has to specify the task. 
The steps involved in the knowledge discovery process are as follows: 

1. First step is to select the task-relevant data. This can be extracted by per- 
forming selection, projection and join operations on the relevant relations. 
In this process only the attributes relevant to the task have to be selected. 

2. This step involves an attribute-oriented induction process, which gener- 
alizes each attribute value in the table by a higher-level concept. Hence, 
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by climbing the conceptual hierarchy, generalization is performed. But, if 
there is a large set of distinct values for a particular attribute, then that 
attribute has to be dropped. For example, the key attributes have to be 
dropped as they are distinct for each tuple and do not contain any knowl- 
edge. The generalization can be performed until a pre-specified threshold 
number of tuples is reached. The resulting relation is called the generalized 
relation, 

3. This step involves simplification of the generalized relation. If several tuples 
contain the same attribute values except one, then they can be reduced to 
one by taking the distinct values of this attribute as a set. 

4. This step transforms the generalized relation into a logic formula. Each tu- 
ple is represented as a logic formula in conjunctive normal form. If multiple 
tuples exist in the simplified generalized relation, they are represented as 
disjunctions of several conjunctive normal logic formulae. 

This algorithm assumes that the conceptual hierarchy is in the form of a 
tree. But if there is multiple inheritance, there will be more than one higher- 
level concept. Therefore, the algorithm has to select the most relevant higher- 
level concept from the set of generalizations. 



5 MODIFICATION OF CONSTRAINTS BY LEARNING FROM 
CHANGES IN DATA 

This section discusses the modification of constraints based on the knowledge 
derived from the data which changes from time to time. This has to be accom- 
plished in two phases. The first phase is to derive knowledge from the data 
and the second phase is to refine the constraints, if required, by integrating 
the derived knowledge into the constraint-base. 



5,1 Knowledge Discovery from data 

To discover knowledge from data, an attribute-oriented, tree-ascending gen- 
eralization technique presented in section 4 is employed. The generalization 
is carried out basically by substituting attribute value of the lower-level con- 
cept by the corresponding higher-level concept. In order to derive knowledge, 
one has to specify the target of deriving such a knowledge. From the speci- 
fied target^ the target class tuples are extracted from the data. The process of 
generalization is performed only on the target-class. In this process, different 
tuples may be generalized to the same concept and it results in a reduced 
number of tuples. These small number of tuples can be transformed into a 
logic formula, which describes the rule derived from this target- class. The 
knowledge discovery process is illustrated with the following example. 
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Figure 3 Concept Hierarchy 
Course-history 



Sname 


Major 


Coursei 


Course 2 


John 


CS 


CompArch 


ExpertDB 


Tom 


CS 


Networks 


DistriDB 


Paul 


CS 


Networks 


IntroSE 


Sam 


CS 


Networks 


SWDesign 


Ram 


CS 


CompArch 


IntroDB 



Employment 



Sname 


Status 


John 


employed 


Tom 


employed 


Paul 


unemployed 


Sam 


unemployed 


Ram 


employed 



Consider a simple University Database. Assume that it maintains data 
about Career Development and contains two relations ‘‘Course-history” and 
“Employment” as shown below. 

Let there be the following constraint on the students of “CS” majors, which 
mandates that all students of CS major have to take two courses outside their 
major area. 

^{x)Student{x)A major(a;) = CS Coursei(x) E EE A Course 2 (o:) E IS 
The concept hierarchy relevant to this example also shown in Figure 3 is as 
follows: 

{IntroDB, DistriDB, ExpertDB } C DB 
{IntroSE, SWDesign } C Software 
{DB, Software } C IS 
{CompArch, CompTech, Networks} C EE 

Once, the user specifies the target-class which is nothing but the class of 
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the previous students who are employed, say “employed-students.” A join 
operation is performed to derive the required data and then the target- class 
is extracted from the derived relation. The resulting relation is as shown in 
below, which shows the tuples of the target- class separated from the rest of 
the data. 



Course-history cxi Career-development 



Sname 


Major 


Coursei 


Course 2 


Status 


John 


cs 


CompArch 


ExpertDB 


employed 


Tom 


cs 


Networks 


DistriDB 


employed 


Ram 


cs 


CompArch 


IntroDB 


employed 


Paul 


cs 


Networks 


IntroSE 


unemployed 


Sam 


cs 


Networks 


IntroSE 


unemployed 



Now, the generalization is performed on this data, and the rule/rules that 
characterize the em target class are derived using the tree-ascending process. 
The generalization would results in the following tuple and is shown below. 



generalized-relation 



Major 


Coursei 


Course 2 


Status 


cs 


EE 


DB 


employed 



During the induction process, the key attribute “Sname” is dropped as it 
does not contain any knowledge. Therefore, this tuple can be logically ex- 
pressed as, 

V(x) employed-students(aj) => Coursei(x) E EE A Course 2 (aj) G DB 
which is nothing but knowledge discovered in the form of a rule, representing 
the target-class. 

The knowledge discovery methodology described here is similar to that in 
(Cai et al. 1991). At this point, the next step is to make use of this de- 
rived knowledge to refine our constraints, if necessary. The system has to see 
whether the derived knowledge is relevant or not to modify the constraints 
and if so it also has to integrate this knowledge into the previous schema 
and refine the constraints only that are relevant. If there is any modification 
required to the constraint-base, the system has to signal the same and has to 
output those changes so that the system administrator can take a final deci- 
sion on materializing these changes. The following subsection discusses these 
aspects. 
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5.2 Constraint Refinement 

There are two steps involved in performing the task of constraint refinement. 
The first step is to find the set of constraints that needs modification depend- 
ing on the discovered knowledge. The second step is to integrate the discovered 
rule into the system. 

1. Regarding the first task of determining the relevant set of constraints, the 
following are the two ways of doing it. 



(a) The first approach is to attach reasons to each constraint. These reasons 
are nothing but the descriptions of the existence of the constraint. In 
other words, one has to reason about the constraints. These rules have 
to be stored in the knowledge base. If we return to the constraint in the 
example in the previous subsection on the course requirement, these rea- 
sons could be (i) employment (ii) state-stipulated rule (iii) school policy 
(iv) department chairman’s decision etc. Now the system compares the 
target with the existing re^lsons attached to each constraint by scanning 
through the reason-set and picks up the relevant set of constraints that 
matches with the target. 

(b) The second approach is a brute-force method that involves comparison 
of the attributes in the derived rule and those in each of the constraints. 
Whenever there is a constraint that contains any of the attributes in 
the rule, that constraint ha^ to be included into the set of the relevant 
constraints. 



2. The next task is to modify the set of relevant constraints. This can be done 
by simply replacing the right hand side of the relevant constraints by that 
of the derived rule. In our example, the constraint is modified as, 
\/{x){Student{x)/\ major(x) = CS) => (Coursei(x) G EE A Course 2 (x) G 

DB) 



6 MODIFICATION OF CONSTRAINTS BY LEARNING FROM 
EXCEPTIONS 

Once the system starts accepting exceptional data, number of problems crop 
up. In order to store the exceptional data many modifications to the data as 
well as to the constraints have to be made. If the exceptional data accumulates 
to a large quantity, then there is a need to modify the constraints. In this 
section, mechanisms to handle exceptions and those to modify the constraints 
are discussed (originally proposed in (Borgida & Williamson 1985)). 
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6.1 Handling Exceptions 

This subsection explains the problems encountered in handling exceptions and 
the methods to resolve them. The problems that are usually encountered by 
allowing exceptions are the following: 



1. How to store the exceptional data. 

2. What to do if some other user accesses the exceptional data. 

3. How to continue checking for future violations without causing any false 
alarms due to the pcist violations. 



Basically, new data enters the database by one of the following operations: 
create, insert, modify. If there is a violation of an integrity constraint, 
usually the system responds to it by signaling a violation. In such cases, if 
the user still insists on storing that fact, the system allows it and stores that 
instance as an exception. This exceptional data can be entered by a special 
set of operators, exnal-create, exnal- modify, exnal-insert. This data is 
stored in logically separate files. 

Similarly, if any user wants to retrieve this exceptional data, he/she will 
be cautioned that it is an exceptional data. This can be retrieved only by 
operations such as exnal-retrieve. Thus the user can determine whether 
normal procedures apply in such circumstances or some special actions have 
to be taken. 

If there is a violation of a constraint, the system continues to signal alarms 
as long as the violated data is existing. Therefore, the constraint will be in- 
consistent with the database and it will not be able to distinguish between 
any later violations and the false alarm due to the old exception. Therefore, 
the system has to avoid these false alarms and must be able to continue to 
detect any future violations. And also, the system must be able to handle the 
situations when this exceptional data are accessed. Therefore, to handle this, 
all constraints in the system are modified into the following form: 
e-constraint, : 

V(x) SPECIALconstraint,(a;)V constraint,- (x), 

where x is a sequence of variables, constraint, is the original form of the con- 
straint and SPECIALconstraint,(x) is a predicate which prevents the actual 
condition being evaluated for exceptional ceises. Initially, SPECIALconstraint,-(x) 
is false, but as exceptions are encountered and excused for various argument 
tuples ti, ^ 2 , then SPECIALconstraint,-(x) becomes, 

SPECIALconstraint,-(x) x = ti V x = t 2 ... 

In the following subsection, a technique for characterizing the class of ex- 
ceptions and thereby refining constraints is presented. 
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Figure 4 Exceptional and Normal Data 

6.2 Constraint Modification 

Constraint modification from exceptions always results in relaxing the con- 
straint. Relcixation of a constraint is done in such a way that the previously 
known exceptional data is now considered as normal data. Figure 4(a) shows 
the normal data (n) and the exceptional data (e). After the modification of 
the constraint, all the exceptional data are considered as normal, which is 
shown in Figure 4(b). 

Relaxing a constraint can be done either as described in section 3 or can 
be done by making the constraint to be checked only in certain restricted 
circumstances. To modify the constraints based on the knowledge derived 
from exceptions, the attribute-oriented generalization technique discussed in 
section 4, which makes a specific-to-general search is employed. The process 
of modification of constraints is explained with an example. Assume that the 
specified threshold level is “one.” 

Consider once again the University Database and assume that it contains a 
relation Teaching- Staff tha,t store the information of all its teaching-staff. Also 
assume that there exists a domain constraint on the attribute of “TDegree,” 
which is as follows: 

(TDegree => {Ph.D, M.S, B.S}) V SPECIALconstraintTDegree(a^) 

As mentioned earlier the constraint has already been modified to accommo- 
date exceptions by including SPECIALconstraintTDegree to the actual con- 
straint. 

The concept hierarchy that is relevant to this example also shown in Figure 
5 is as follows: 

{Ph.D, M.S, B.S} C U.S.Degree 

{India, Pakistan) C Asia 

Let us assume that the relation Faculty is as shown in the following table. 
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U.S. Degrees 



Asia 





Ph.D M.S B.S India Pakistan 



Figure 5 Conceptual Hierarchy 



Faculty 



Tname 


Dept 


Cadre 


TDegree 


Country 


John 


CS 


GTA 


Ph.D 


U.S.A. 


Tom 


CS 


GTA 


M.S 


U.S.A. 


Paul 


CS 


GTA 


B.S 


U.S.A. 


Now if a new tuple 
Exnal-Faculty 










Ravi CS GTA 


M.Tech 


India 







enters the database which violates the constraint stated above, then it is 
stored as an exception. Suppose another exceptional tuple enters given below 
enters into the database. 

Exnal- Faculty 

Ahmed CS GTA M.Tech Pakistan 

Then the generalization algorithm converts this into the generalized relation 
as shown below. (The exceptional data is considered as the task-relevant data.) 



Generalized-relation 



Dept 


Cadre 


TDegree 


Country 


CS 


GTA 


M.Tech 


Asia 



As we can see here, in the induction process, the attribute “Tname” is 
dropped as it does not contain any information. (Usually the key attributes are 
eliminated as they are distinct for each tuple and do not contribute to discover 
any knowledge.) In this process, the predicate SPECIALconstraintTDegree(2j) 
is also modified as, 

SPECIALconstraintTDegree(3?) <=> X =Ravi V x =Ahmed 
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If such kind of exceptional information mounts to some threshold value, 
the system signals a modification to the constraints. As the system already 
stores the relevant constraint violated by this exceptional data, that particular 
constraint is modified. As mentioned earlier, modification can be done in two 
ways. 

The constraint can be generalized by increasing the domain of the attribute 
“TDegree.” Thus, the modified constraint is as shown below. 

(TDegree => { Ph.D, M.S, B.S, M.Tech }) V SPECIALconstraintTDegree(®) 
The predicate SPECIALconstraintTDegree(a?) is still there to accommodate 
further exceptions. 

The second approach to modify the constraint is to restrict the circumstances 
in which it is checked. In other words, make the system check this constraint 
unless “Country G Asia.” This can be done as follows: From the above table, 
the following rule can be derived. 

V(x) Exnal- Faculty (x) => (Dept = CS A Cadre = GTA A TDegree = M.Tech 
A Country G Asia ) 

This is compared with the rule derived from the normal data. This rule can 
be derived just the same way as that of exceptional data using the general- 
ization technique. The generalized relation is as shown in below. 



Generalized-relation 



Dept 


Cadre 


TDegree 


Country 


CS 


GTA 


U.S. Degree 


U.S.A. 



V(x) FacuUy(x) => (Dept = CS A Cadre = GTA A TDegree G U.S. Degree A 
Country = U.S. A.) 

Then the attributes that have a different value in these two rules is identified 
with respect to the constraint on the attribute “TDegree.” Therefore, the 
tuples that are being compared are. 



Generalized-relation 



Dept 


Cadre 


TDegree 


Country 


CS 


GTA 


M.Tech 


Asia 


CS 


GTA 


U.S. Degree 


U.S.A. 



Therefore, it results in “Country G Asia.” Hence, the constraint is modified 
as, 

(TDegree =>► {Ph.D, M.S, B.S}) V SPECIALconstraintTDegree V Country G 
Asia 

Whenever there is a modification to a constraint is performed on a con- 
straint, the system outputs that suggested modification. The actual modifi- 
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cation is made by the database administrator. The same approach can be 
employed in case of a simple value constraint or in case of a constraint repre- 
sented as a first-order logical expression. 



7 CONCLUSION 

In order to capture the real-world situations and to fully serve to the users’ 
needs, constraint modification is required in many systems. As the data en- 
tering the system is ever-changing, the refinement of constraints is required in 
order to suit to the environment. Whenever there is a change in the data, 
the changes in the data are such that it may be required to modify the 
constraint-base. And also, if there is an accumulation of a large amount of ex- 
ceptional data, refinement of constraints is required. The modifications in both 
these cases is performed based on the knowledge derived from the data and 
from the exceptions. The derived knowledge is actually in the form of rules, 
and is derived using a generalization technique based on attribute-oriented, 
specific-to-general search. The derived rule is integrated into the constraint- 
base and refines the relevant set of the constraints either by relaxing them or 
by strengthening them. 

In this approach, the system only suggests the required modifications but 
does not modify the constraints on its own. Therefore, a human being, usu- 
ally the database administrator, is involved in the evolution process of the 
constraint-base. It is the job of the database administrator to decide what 
modifications have to be incorporated from the set of suggestions for each 
constraint. This makes the modification more tractable. Thus there is no au- 
tomatic evolution process employed over here. 

Our approach to the modification of constraints due to changes in normal 
data, however is limited to only those constraints that represent the charac- 
teristic rules of the data. More research is need to modify more general con- 
straints including those that require comparison of more than one attribute, 
and those involve aggregate predicates. In the refinement process of the con- 
straints discussed in section 5.2, in order to find the relevant set of constraints 
that need modification, scanning the entire set of reasons attached to the con- 
straints is required. This is quite an expensive process especially when the set 
of rules attached to each constraint is enormously large. In such a case, one 
has to employ an optimization technique that identifies the relevant set of 
constraints more efficiently. 

Performing modifications to the constraint-base might yield a logical con- 
tradiction. In order to maintain the consistency of the constraint base, the 
system should check for such conflicts and has to resolve them before materi- 
alizing any modifications. 

This paper assumes that the conceptual hierarchy is in the form of a tree. 
But, in real situations, there may be some concepts with multiple inheritance. 
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The techniques have to be modified as suggested in section 4, to suit to such 
more general systems. 
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Abstract 

Today many organizations are facing the need to integrate multiple, previously 
independent but semantically related information sources. Although often the 
quality and integrity of the data at a single information source itself is a 
problem, issues and problems regarding the quality and integrity of integrated 
data have been mainly neglected by approaches to data integration. As recent 
studies show, this raises severe problems for global applications that depend 
on the reliability of the integrated data. 

In this paper we focus on problems and possible solutions for modeling and 
managing data quality and integrity of integrated data. For this, we propose 
a taxonomy of data quality aspects that includes important attributes such as 
timeliness and completeness of local information sources. We use the federated 
database approach to describe how data quality aspects can be modeled as 
metadata during database integration. These metadata are employed to spec- 
ify data quality related query goals for global applications and to dynamically 
integrate data from component databases storing data of different quality. 
The presented concepts are based on treating data quality and integrity as a 
first-class concept in both metadata model and global query language. 



Keywords 
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1 INTRODUCTION 

In order to stay competitive and to rely on accurate and complete infor- 
mation, many enterprises and federal agencies are facing the need to inte- 
grate several, previously independent but semantically related information 
sources into globally accessible systems. Integration approaches are typically 
based on multidatabase or federated database systems (see, e.g., (Sheth and 
Larson 1990, Bright et al. 1992, Kim 1995)). Applications built on top of such 
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systems include data warehouses, decision support systems, hospital informa- 
tion systems, environmental information systems etc. 

Recent studies and reports show that these applications, in particular data 
warehouses, often experience several problems with regard to the reliability 
of the integrated data (Kimball 1996, Bischoff and Alexander 1997, Jarke and 
Vassiliou 1997). The main reason for this is that often already the component 
databases participating in the federation contain incorrect and poor quality 
data. The quality and integrity of the integrated data then becomes even 
worse unless suitable methods and techniques for modeling and managing 
these issues are employed. 

During the past decade a significant amount of literature on database 
integration has been published. Most of the proposed approach focus on 
the problem of resolving structural and semantic conflicts among (hetero- 
geneous) component database. For an overview of the problems and proposed 
solutions see, e.g., (Kim and Seo 1991, Spaccapietra et al 1992, Sheth and 
Kashyap 1993, Kim et al. 1995). Only a minor part of the literature discusses 
how to handle integrity constraints in data integration, mainly in connection 
with resolving structural and semantic conflicts, e.g., (Reddy et al 1995, Ver- 
meer and Apers 1996, Conrad et al 1997). 

In this paper we claim that, from a practical point of view, the traditional 
notion of data integrity, i.e., the semantic correctness of stored data, plays 
only a minor role with regard to the reliability of integrated data. Typical 
problems that occur at the integration level and which cannot be prevented 
by traditional integrity maintaining methods are rather data quality prob- 
lems than data integrity problems. For example, the aspect of outdated or 
expired data at the integration level often is referred to as a data integrity 
problem. But there neither exists a formalism nor a technique to prevent the 
integration of outdated data. A similar assumption is made for the accuracy 
or completeness of data at the integration level. Again, there does not exist a 
data integrity concept that covers these aspects during database integration. 

There are two key issues or rather assumptions in database integration 
approaches that contribute to the fact of poor quality and integrity data at 
the integration level. First, existing approaches assume that the data stored 
at component databases (CDBs) somehow have the same high quality and 
correctness. Considering individual CDBs this is true, because the data at each 
site are sufficient for local applications. That is, for example, while outdated 
data may be sufficient for a local application at one CDB, another CDB must 
always contain up-to-date data. Similar scenarios can be given concerning the 
completeness and accuracy of data, aspects which are also not considered by 
existing data integration approaches. Thus there is a strong need for modeling 
these aspects as well. 

But such a modeling approach leads to the next problem. Existing data in- 
tegration techniques assume that schematic and semantic conflicts can always 
be resolved statically by giving unique schemas for global relations and atsso- 
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ciated data integration rules. While this often is true for schematic aspects, 
the data stored at different component databases typically change with time. 
For example, a component database may only have up-to-date data at certain 
days in a week. We thus cannot always integrate data (of different quality and 
integrity) in a static manner by using a fixed set of data integration rules. 

The above aspects raise the question how we can model, manage, and repre- 
sent different data quality and integrity aspects during database integration. 
In this paper, we describe our observations and preliminary results on study- 
ing different aspects of data quality and integrity in federated databases. The 
main idea is to treat data quality as a first-class concept in data integra- 
tion approaches and in querying integrated data. Based on a taxonomy of 
(time- varying) data quality aspects, these aspects can be modeled suitably 
as metadata at the integration level. For global applications, the designer 
can specify data quality goals for global queries used in applications, and 
the global query processor suitably decomposes global queries into subqueries 
such that the retrieved data all have the same quality. This ensures that data 
of different quality are not combined or joined. Information about the quality 
of retrieved data is represented at the integration level as well, thus providing 
global users a suitable means to cope with incorrect and poor quality data. 

The paper is organized as follows: In Section 2 we outline the main con- 
cepts and techniques of the federated database scenario that are important 
for the presented approach. In Section 3 we motivate the distinction between 
data quality and data integrity and, based on a taxonomy of data quality, we 
show how data quality aspects can be modeled as metadata during database 
integration. The usage of the metadata, which also include information about 
local integrity constraints, is discussed in Section 4. Finally, concluding re- 
marks and future work is presented in Section 5. 

2 THE FEDERATED DATABASE APPROACH 

For our approach to manage data quality and integrity in a multidatabase 
environment, we adopt the schema and system architecture for federated 
database systems described in, e.g., (Sheth and Larson 1990). That is, we 
assume a global (or integrated) database schema that provides users and ap- 
plications with an integrated view of all local databases schemas (or export 
schemas). For our discussions, we use the relational data model as the global 
data model because of its simple structure. The presented approach can easily 
be adopted to an Object-Oriented model. 

The basic components in of a federated database system is the federation 
layer, also called integration level (see also Figure 1). Depending on the type 
of global applications, this layer can be itself a database system providing a 
global data model for schema integration, a metadata repository capturing 
the information related to the integration process, and a global query lan- 
guage and processor. The query processor utilizes the metadata to map queries 
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against the global schema to (sub) queries against the component databases 
(CDBs), and is discussed in more detail in Section 4. If global applications 
are read-only applications such as, e.g., data warehouses, then the above com- 
ponents are sufficient to build a federation layer. Data modifications through 
global applications additionally require a global transaction manager, e.g., 
(Breitbart et al 1995), which will not be considered in this paper. 




Figure 1 Schema and System Architecture of a Federated Database System 

A major task in integrating data from different autonomous and possi- 
bly heterogeneous database systems is to detect correspondences among the 
elements of component schemas and to resolve possible conflicts (Batini et 
al. 1986, Sheth and Larson 1990, Kim et al. 1995). Conflicts arise due to 
various heterogeneity of the databases participating in a federation, rang- 
ing from technological discrepancies (software, hardware) to schematic and 
semantic heterogeneity. While the former type of conflicts often can be re- 
solved using suitable protocols and wrappers, the latter type of confiicts is 
much more difficult to cope with and hats been a major concern among the 
database research community (Bright et al. 1992, Litwin et al. 1990, Sheth 
and Kashyap 1993, Kim et al. 1995). 

Schematic and semantic heterogeneity stems from the ability of component 
databases to choose their own design, including data model, query language, 
naming of data (relations and attributes) , and in particular semantic interpre- 
tations of the data and attribute values. During the past decade a significant 
amount of literature discussing methods on resolving schematic and seman- 
tic confiicts has been published. These methods focus on naming confiicts 
(homonyms and synonyms) and in particular on semantic data confiicts (mod- 
eling the same information in different ways) (Kim and Seo 1991, Rusinkiewicz 
and Missier 1995, Spaccapietra et al. 1992). 
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Essential for our approach to manage data quality and integrity in federated 
databases is the fact that almost all approaches to resolving schematic and se- 
mantic conflicts are based on static conflict resolution and in particular data 
integration rules. Once semantic conflicts among data stored in component 
databases have been detected, they are resolved by specifying suitable con- 
flict resolution rules or views in the metadata repository. Such rules include 
relation and attribute renaming, joining local relations, and data conversions 
between different scales and measurements. After restructuring of local rela- 
tions at the federation layer, component schemas are in a form that can easily 
be integrated by, e.g., building the union of the restructured local relations 
that contain similar or related data. These views then build the elements of 
the global database schema and only these views are visible to global users and 
applications. For decomposing global queries against these views, the global 
query processor employs data integration and conflict resolution rules which 
are recorded in the metadata repository. 

3 DATA QUALITY AND INTEGRITY IN DATABASE 
INTEGRATION 

In this section we investigate the distinction between the notions of data qual- 
ity and data integrity, and we examine how the underlying concepts influence 
traditional database integration approaches. As we will show in Section 3.1, 
many aspects often referred to as data integrity are basically data quality 
aspects. In Section 3.2 we show how to identify and describe different data 
quality aspects during database integration. It turns out that neglecting these 
aspects can have a major impact on the reliability of the integrated data. In 
Section 3.3 we Anally describe how the proposed data quality and integrity 
aspects can be modeled as integration metadata, thus allowing designers of 
global applications and the query processor to treat these aspects as first-class 
concepts at the integration level. 

3.1 From Data Integrity to Data Quality 

As information is becoming a key organizational resource, many companies 
and federal agencies are concerned about the quality of the data they manage 
and use for their applications (Redman 1996). In the past years a significant 
amount of literature focusing on data quality has been published, ranging from 
defining frameworks of data quality aspects (Wang and Strong 1996) to data 
quality improvement strategies, the so-called Total Data Quality Management 
(Wang 1998). 

The essential problem in data quality management is the lack of precise 
definitions for data quality, thus making it difficult to (automatically) prevent 
or detect poor quality data. Apparently, this seems not to be a problem for 
data integrity where the semantic correctness of the data stored in a database 
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is specified by means of some logic-based formulas, e.g., the tuple relational 
calculus. Based on such specifications, integrity maintaining mechanisms such 
as triggers often can easily be derived (Grefen and Apers 1993, Ceri and 
Widom 1990). 

In our opinion, the main difference between data quality and the traditional 
notion of data integrity is that data integrity concerns only the data stored 
in the database. In contrast, data quality additionally relates the data stored 
in the database with its “fitness for use”. That is, data quality also refers to 
applications and the portion of the real world modeled in the database. In 
fact, many important requirements we often refer to as data integrity cannot 
simply be specified by integrity constraints over a given database schema. For 
example, how to model (and maintain) the requirement that the data stored 
in a database are always complete with regard to the information we have 
in the real world? There exists no mechanism that forces the user, e.g., in a 
payroll office, to enter all information about all employees. In this case, tra- 
ditional integrity constraints can only address the semantic correctness of the 
information stored about employees. Typical integrity constraints focusing on 
the completeness of information either require well defined attribute values 
for data records (thus preventing null values) or they describe foreign key con- 
ditions, thus assuming that already some portion of the relevant information 
has been stored. 

Another prominent example often referred to as a data integrity problem is 
that a database should always contain only up-to-date data. But again, there 
neither exists a concept to formally specify this requirement nor to prevent 
storing outdated or expired data.’*' 

Already these simple examples show that in order to ensure reliable data, 
we need new concepts and techniques that extend the traditional notion of 
data integrity and correctness. We cannot always expect high quality data 
or data that satisfy all integrity constraints. The new concepts must be rich 
enough to deal will less than total integrity as indicted in (Sheth 1997). 

Although there does not exist a precise definition for data quality, the fol- 
lowing dimensions are frequently used to analyze data quality aspects,’*' and 
will also be considered later in this paper. 

1. accuracy^ describing how precise and accurate real world information is 
mapped into local data structures (e.g., exact versus approximate values). 

2. completeness, meaning that all real world information relevant for applica- 
tions is recorded in the database. 

3. timeliness, referring to the fact that the recorded information is up-to-date 
and not expired. 

Note that in order to prevent incorrect data, these dimensions cannot be 

* Unless one uses concepts that store the validity interval of data as it is done in temporal 
databases. 

*In (Wang and Strong 1996) a survey is presented that lists 179 data quality attributes 
suggested by various data consumers. 
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specified formally for a database, and they are even more difficult to maintain. 
In a centralized database system the detection of data not satisfying some of 
the above dimensions often only occurs accidentally when, e.g., the retrieved 
data is compared with other information. 

Beside the above rather intuitive notions, for our approach we furthermore 
add the dimension of consistency to data quality aspects. This aspect refers to 
the traditional notion of data integrity typically used in centralized database 
systems. 

4. consistency^ requiring that the data stored in a database satisfy the in- 
tegrity constraints specified for the database. 

Only for the dimension of data consistency data quality maintaining mech- 
anisms exist. With regard to the other dimensions we have to cope with in- 
correct and poor quality data unless additional concepts, such as described in 
the following sections, allow us to identify and manage these data. 

3.2 Data Quality in Data Integration 

Recent reports on the usage of integrated data, in particular of data ware- 
houses and decision support systems, show that the quality of integrated data 
causes major problems regarding the reliability of the information obtained 
through database federations. The reason for this is that quality and integrity 
of the data often has not been an issue for single, centralized database sys- 
tems because users and applications “know” their data and they know how to 
(manually) cope with inconsistencies or rather poor quality data. Thus data 
having poor quality and integrity are often not detected until the usage of 
the data changes. This is exactly the scenario that happens when data are 
integrated from multiple database systems in order to fulfill the requirements 
of new, global applications. 

In this section we present a concept that allows us to detect and handle 
poor quality data during data integration. The rationale behind this concept 
is that while it is difficult to detect incorrect or poor quality data in centralized 
databases, during data integration the quality of data can be compared. Such 
comparisons axe typically performed when semantic data conflicts among the 
component databases containing semantically equivalent data are resolved. 
Existing approaches to data integration, as sketched in Section 2, handle such 
conflicts in a static manner. That is, a unique conflict resolution rule is given 
that specifies how data are integrated. However, as we will see below, for 
many such data conflicts there are no unique data integration rules because 
the correctness and quality of the data changes with time. 

Example 1 Suppose two relations at two different component databases 
(CDBs) that store environmental data. Both relations (named pollution) 
contain data about the quantity of some toxic material recorded for different 
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regions. Assume that both relations have already been restructured and that 
they are “ready” to be integrated into a global relation Pollution. 
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In order to define a data integration rule for these two relations obviously a 
data conflict must be resolved due to the different quantities recorded for same 
regions. For this, data integration approaches suggest a conflict resolution 
function such that, e.g., in the global relation the quantity for an area is the 
average of the two values recorded for this area at CDBl and CDB2. Now 
assume that for the above scenario we get further information from the local 
DBAs stating the following: CDBl is updated on Mondays, Thursdays, and 
Saturdays, and CDB2 is updated on Tuesdays, Fridays, and Sundays. Thus 
there is no reason to miss out one relation for integration or to compute 
average data values. 

In fact, both relations must be considered for integration, and the date 
a global query is issued determines from which relation to select up-to-date 
information. There is no unique integration rule ensuring the retrieval of up- 
to-date data because the data change with time. 

Example 2 Suppose a third component database CDB3 with the following 
(restructured) relation: 

PollutionQCDBS 
Region Area Quantity 

R1 null 45 
R2 null 

In comparison to the relations from CDBl and CDB2, the third relation 
contains less accurate information. Therefore, data integration approaches 
would not consider this relation for integration. We argue that this relation 
still should be included, because in case both CDBl and CDB2 are not reach- 
able from the federation level, it is still possible to access information with 
less quality. 
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It should be mentioned that another reason for integrating all three rela- 
tions (in a Suitable manner) is that often users are interested in only “rough” 
information. If corresponding language constructs are provided within the 
global query language, the relation PollutionQCDBS would be sufficient to 
satisfy such requests. More importantly, some queries then can be processed 
much more efficiently because this relation already contains some aggregate 
data. Thus the data volume can be regarded as another useful data quality 
aspect. 

The above examples all exhibit some kind of data conflicts, leading to the 
following observations: 

1. Although the quality and integrity of the data at each individual component 
database can be high, concerning the integrated, global view of data, quality 
and integrity can be poor. 

2. (Semantic) Data conflicts cannot always be resolved statically by specifying 
unique data integration rules at federation design time, because of the 
dynamic of the data stored at component databases. In such cases static 
rules would induce varying data quality and integrity at the integration 
level. 

3. Integrating poor quality data or data with less integrity (as in the second 
example) sometimes is better than integrating no data at all. 

4. Depending on the data requested by global applications, integrating poor 
quality data can decrease query processing cost. 

Based on the above observations, another important question is how data 
integration rules should look like in the presence of data having (time- varying) 
quality. If all data have the same quality, of course, traditional integration 
rules can be adopted. For example, assuming that the first two relations 
in the example above contain data of the same quality, a global relation 
pollution®GLOBAL could simply be described by the data integration rule 
pollution@GLOBAL := pollution@CDBl. 

In case there are differences among the quality of data, roughly speaking, we 
have to integrate data from all three relations, i.e., we have the general inte- 
gration rule pollution@GLOBAL := 

pollution@CDBl U pollution@CDB2 U pollution@CDB3. 

This is an essential difference to the traditional approaches to data integra- 
tion where integration rules are based on whether the extensions of the re- 
lations to be integrated are disjoint or overlap. Furthermore, building the 
union of all relations, as suggested in our approach, requires a different ap- 
proach for processing, more precisely decomposing, global queries that refer 
to pollutionQGLOBAL. We discuss this aspect in more detail in Section 4. 

Before we describe how different data quality aspects can be modeled at the 
integration level, we discuss “suitable” strategies handling the integration of 
integrity constraints that may exist at component databases. Despite detailed 
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proposal on how to handle integrity constraints in database integration (e.g., 
(Reddy et al. 1995, Vermeer and Apers 1996, Conrad et al. 1997)), there is 
no consensus on an optimal strategy for designing or deriving global integrity 
constraints. The two extremes range from not to consider integrity constraints 
at all (often suitable for global read-only applications) to the integration of 
the most restrictive integrity constraints. 

We argue that often none of the proposed approaches fulfill practical needs 
to cope with less than absolute integrity. For example, not considering some 
information from one component database because it does not satisfy a global 
integrity constraints prevents the integration from perhaps “useful” informa- 
tion. If, for instance, some tuples from a relation stored at a CDB do not 
satisfy a simple global domain constraint, the tuples still may carry some 
useful information and should be considered for integration. This, however, 
requires that we suitably model this aspect at the integration level. The as- 
pect of explicit handling integrity constraints in global query processing and 
their usage for “data scrubbing” is discussed in Section 4. 

3,3 Modeling Data Quality 

In order to suitably model data quality and integrity aspects discovered dur- 
ing conflict resolution and data integration, additional constructs must be 
provided at the integration level, more precisely, within the metadata reposi- 
tory. These additional metadata record respective information and make this 
information accessible to global applications and to the query processor. Al- 
though it seems that data quality and integrity aspects are just some further 
attributes that can be associated with component databases and stored in- 
formation, it is difficult to simply assign, e.g., discrete values to sites and 
information such as “complete” , “less complete” , or “not complete” . 

As shown in the previous section, quality aspects can often only be detected 
by comparing information from two CDBs. In order to model these aspects, 
we define a set Q := {Qi, . . . , Qe} of basic quality dimension: 

Qi accuracy Q 4 availability 

Q 2 completeness Q 5 reliability 

Qz timeliness Qe data volume 

Note that with regard to the quality aspects listed in Section 3.2, we have 
added the dimensions Q 4 , Qs, and Qq which are only relevant for information 
accessed through a federation. We suggest that the dimension reliability is 
used only if no other dimensions such as accuracy, completeness, or timeliness 
can be applied. In this respect, reliability can be considered as a generalization 
of Qi,Q 2 , and Qz- Further dimensions, of course, are possible and should be 
included depending on the properties and usage of integrated data. 

Comparisons of data quality can occur at different levels of granularity, 
e.g., we can compare relations at different CDBs or CDBs as a whole. We 
assume that data quality comparisons with regard to a quality dimension 
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Qi must occur on information units of the same granularity. The different 
types of granularity we consider are component databases (C), relations (R), 
(selected) tuples from a relation (determined through a selection ctf(R)) (T), 
and projections on attributes of (selected) tuples (A). 

For the different types, we assume a function origin that can be applied to 
any information unit and returns the name of the CDB the unit is contained in. 
For example origin(CDB) = CDB, or origin(R@CDBl) = CDBl. As we will 
show later, this function can be useful to query the origin of poor quality data 
retrieved by a global query. Comparisons of data quality between semantically 
related information units having the same granularity are defined as follows: 

Definition 3 Let gran{I) denote the granularity of an information unit, i.e., 
gran{I) G {C,R,T,A}. For two semantically related information units h.Ii 
such that gran{Ik) = gran{Ii), a data quality comparison with regard to a 
quality dimension Qi G Q is denoted by ^ {>»=}• 

The meaning of IiQqJk’, also called quality statement, is that the quality of 
the information represented by // is higher (0 = >) or equal (0 = =) to the 
quality of the information represented by h wrt the quality dimension Qi. 

Data quality comparisons among semantically related information units, 
say two relations R\ and R 2 which share an attribute A, can be specified 
formally. R\ is said to be more up-to-date (at time point t) than R 2 iff R\ 
contains more tuples than R 2 that have recently been updated on A. 

It is important to note that the knowledge necessary to determine qual- 
ity statements can often be derived from information profiles. Such profiles, 
which can be associated with information units of any granularity, describe 
the information processing techniques employed to map real world data into 
local data structures and the maintenance of these data. 

The coarsest information unit quality comparison can be applied to are 
CDBs, the finest granularity is determined by attributes of (selected) tuples 
stored in a relation at a CDB. For tuples and attributes describing // and Ik, 
we require that the same selection condition F and projection is used. 

Since one cannot expect that for all information units and all quality di- 
mensions quality aspects are known and thus can be compared, we take the 
following assumptions. 

Assumptions 4 

A1 For a quality statement IiQQjk with gran{Ii) = gran{Ik), we assume 
that for all information units of finer granularity with origin{I[) = 
arigin{Ii) and origin{Il) = origin{Ik) the quality statement I[Qq.V^ 
holds. 

A2 Exceptions of assumption A1 must be specified explicitly, i.e., for 7/0 q. Ik 
we can have a quality statement Il&Qill- 

A3 For all semantically related information units from the different CDBs 
having the same granularity, we assume that if no quality statement with 
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regard to Qi has been specified, induced by Al, or no exceptions are given, 
then these information units all have the same quality wrt Qi (which can 
be different from those specified explicitly with regard to Qi). 

It should be mentioned that there are many special cases for the differ- 
ent quality attributes. For example, the quality dimension Q4 (availability) 
can only be associated with the coarsest information units, i.e., component 
databases. Assumption Al is useful in order to specify, e.g., that all data 
stored in one component database are in general more complete than the 
data of another CDB. This property then is automatically inherited to the 
relations at these CDBs. For certain relations at these CDBs, however, we 
allow exceptions which must be specified explicitly. 

A set S of quality statements over a set C of component databases and asso- 
ciated information units of finer granularity has to satisfy certain properties. 

Definition 5 A set «S of quality statements is correct iff there exists no con- 
tradiction between explicitly specified quality statements, and S is complete 
iff every information unit is considered in an explicit or implicit quality state- 
ment. 

Although assumption A3 ensures the completeness of quality statements, 
it does not state how to handle information units that all have the same 
quality. Assume that we have five component databases, and each CDB con- 
tains a relation P to be integrated into a global relation P. Furthermore sup- 
pose that the only quality statements we have are P@CDB1 >53 P@CDB2 and 
P@CDB1 >Q, P@CDB3. Due to A3, we have that P@CDB2 =q, P@CDB3, but we 
also have that P@CDB4 =q 2 P@CDB5. But how are these two relations related 
to the relations at CDBl, CDB2, and CDBS with regard to Q2? One could re- 
quire that the designer of the federation has to solve such cases by connecting 
P@CDB4 and P@CDB5 with at most one of the other three relations by qual- 
ity statements. A more appropriate approach, however, would be to consider 
P@CDB4 and P@CDB5 as a separate class that must be handled appropriately 
by the query processor (see below). 

As discussed earlier and shown in Example 1, the quality of data may change 
as time advances. Above quality statements, however, only describe static 
properties of data quality. For this reason, we allow that with each quality 
statement a temporal condition TC can be associated. A temporal condition 
essentially describes a validity interval that specifies when a quality statement 
holds. For the sake of simplicity, we assume that a validity interval 
either can be described by a start and end date, e.g., [01-01-96, 12-31-97], 
or by a set of explicit dates. If no temporal condition is specified, we assume 
that always holds. More useful and complex specifications for validity 

intervals, of course, should be investigated. In this context it might even be 
useful to associate a time interval with a single information unit, leading to a 
similar concept as it can be found in temporal databases. Including temporal 
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conditions also requires extending the notion of a complete and correct set of 
quality statements which, however, is trivial and will not be discussed here. 

Example 6 Assume the three relations from the CDBs discussed in Example 
2. The specification of the quality statements could look as follows (with 
R = Pollution): 

Qi : R@CDB1 =q, R@CDB2 >q, R@CDB3 (Accuracy) 

Q 2 : R@CDB1 =Q, R@CDB2 >q, R@CDB3 (Completeness) 

Qs : R@CDB1 >q 3 R@CDB2 (Timeliness) 

with TC — sysdate.dayG {’monday’, ’thursday’, ’Saturday’} 
R@CDB2 >q 3 R@CDB1 (Timeliness) 

with TC = sysdate.day€ {’tuesday’, ’friday’, ’sunday’} 

Q 4 : CDBl =Q^ CDB2 >q^ CDB3 (Availability) 

Qe : R@CDB1 =q, R@CDB2 >q, R@CDB3 (Data volume) 

Although we have not given a complete formal specification of the language 
that can be used to formulate quality statements and temporal conditions, 
it should be obvious that a complete formalism can easily be developed and 
that respective specifications can be represented in the metadata repository. 
Interestingly, correctness and completeness of quality statements then can be 
considered as integrity constraints imposed on respective metadata. 

The final step in modeling data quality aspects consists of handling local 
integrity constraints, which are specified in the global data model. For this, 
we take a rather pragmatic approach which, we think, is most appropriate if 
global applications only access integrated data but do not modify these data. 
We assume that for each component database CDBz a set of local integrity 
constraints is specified. In practice, these constraints turn out to be quite 
simple, mainly restricted to domain and foreign key constraints. The definition 
of each constraint and the names of the global relations the constraint affects is 
recorded in the metadata repository. Information about integrity constraints 
that cannot be associated with a component database but which have been 
derived in combination with conflict resolving rules are recorded as well. 

The discussions in this section show that the metadata repository plays an 
important role during the database integration task. All information related to 
data quality and data integrity aspects are recorded as metadata in addition 
to conflict resolving and data integration rules. 



4 QUERY PROCESSING 

In the federated database approach, the global schema is employed to submit 
a global query. Objects of the global schema are transparent to the user, i.e., 
the user is unaware of the component databases and the data integration rules 
encoded in the metadata. The metadata are used by the global query proces- 
sor which decomposes the global query into global subqueries such that the 
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data needed by each subquery are available from one CDB. The global query 
processor also translates each global subquery into queries of the correspond- 
ing CDB and finally combines the results returned by the subqueries. For 
a detailed discussion about query processing in federated and multidatabase 
system, we refer the reader to, e.g., (Meng and Yu 1995) or (Evrendilek et 
al 1997). 

In the presence of different data quality aspects, which are encoded in the 
metadata, it is quite obvious that global queries cannot simply be decomposed 
solely based on the prespecified unique data integration rules. Assume, for ex- 
ample, the global relation Pollution [©GLOBAL] defined by the integration 
rule pollution := pollution@CDBl U pollution@CDB2 U pollution@CDB3 
(see also Example 2) and that we want to issue the query 

select Region, sum (Quantity) from Pollution [©GLOBAL] 
group by Region 

Using only the above integration rule we would get an erroneous result. But 
what do we expect? The general strategy should be that we always want high 
quality data. That is, data integration rules should exclude (time- varying) 
poor quality data. However, provided that we have suitable language con- 
structs in our global query language, one could also be interested in all data 
which include data of different quality. In this case we get a multiresolution 
query. For the above example, we get three sets of tuples, and each set must 
suitably be represented to reflect differences in the quality of the result (in 
the above case outdated and less accurate data must be indicated) . 

Given a set of quality statements including temporal conditions, the prob- 
lem of global query processing can be reduced to the following issues: 

1. Tuples and attributes having different quality cannot simply be combined 
using the union operator, and 

2. tuples having different quality cannot be joined, e.g., it must not be possible 
to join outdated or expired data (tuples) with up-to-date data. 



In this paper we suggest a preliminary concept of a conservative approach to 
these problems. For this, we assume that the main usage of the integrated data 
is not based on ad-hoc queries, but applications having well-defined quality 
requirements with regard to the data to be retrieved. We also assume that typ- 
ical global queries are simple and do not contain complex subqueries. Different 
global applications may have different requirements concerning the quality of 
integrated data used in the applications. Rather than to encode these re- 
quirements explicitly in numerous data integration rules tailored to certain 
applications, global queries on global relations can be enriched by data qual- 
ity conditions. These conditions are specified by the designer who employs the 
quality statements encoded in the metadata repository in a semi-transparent 
fashion. For this, we suggest an extension of the global query language, say 
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SQL, by quality predicates. These predicates specify the data quality goal of 
a global query as shown in the general pattern 

select <list of attributes> 
from <list Ri, . . . ,Rn of global relations> 
where <SQL selection condition> 
with goal <list of data quality goals>; 

The first goal in the list of data quality goals designates the primary data 
quality goal, the next goal the secondary data quality goal an so on. A simple 
goal is either most up-to-date, most accurate, most complete, or most reliable. 

Consider, for example, the primary goal most up-to-date, which refers to the 
timeliness of the integrated data. In this case, for each global relation Ri, the 
query processor reduces the corresponding general data integration rule to an 
integration rule that considers only the most up-to-date local relations that 
build up Ri at the time point t where there query is issued. Information about 
the most up-to-date local relations can easily be determined by using the 
quality statements and associated temporal conditions. If several local rela- 
tions satisfy the specified goal, only those relations are chosen that satisfy the 
secondary goal best. If we allow temporal conditions for quality statements, 
only those information units (relations) are considered in the integration rule 
that satisfy these conditions. Thus data quality specific data integration rules 
are generated dynamically from general integration rules that do not consider 
data quality aspects. 

Example 7 Assume a global query referring to global relations R(= Ri U R2) 
and S(= SiU S2 U S3) and that we have the quality statements Ri >Qj R2 and 
Si >Qi S2 =Qi S3 referring to the accuracy of the integrated data. With the 
data quality goal most accurate the query processor reduces the two gen- 
eral data integration rules to R = Ri and S = Si. In case there are no quality 
statements about the accuracy of the local relations Ri,R2, the original data 
integration rules are chosen for R and S (which include duplicate elimination). 

Query language constructs to retrieve data not satisfying the specified query 
goal(s) must be provided as well in order to compare data of different quality 
and thus to determine poor quality data (and its origin). Assume a global 
relation S as specified above, but with the quality statements referring to 
the completeness. The information that this relation consists of two groups 
of data having different quality can easily be represented to the designer. 
The global query language furthermore must provide the designer language 
constructs to retrieve all data or only data that belong to one of these two 
groups. Provided that respective language constructs exist, the designer can 
retrieve groups separately, and he can specify select statements that retrieve 
tuples which occur in all or only specified groups. The basic idea behind this 
concept is that the designer has the possibility to compare quality aspects of 
integrated data at the federation level. The information obtained about these 
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aspects then leads to the formulation of global queries that are used by global 
applications. 

Before we conclude this paper in the next section, we finally discuss how in- 
tegrity constraints should be considered at the integration level and for global 
applications. We suggest to employ local integrity constraints (formulated in 
the global data model) to filter tuples retrieved by global queries. Assuming 
that the constraints specified at CDBs are simple, restricted, e.g., to domain 
and foreign key constraints, sophisticated techniques for data scrubbing can 
be developed. 

Assume, for example, the designer wants to retrieve data from a global re- 
lation R (for which quality statements exist). Provided that there are suitable 
tools at the federation layer, information about the existence of integrity con- 
straints affecting this relation can be displayed to the designer. Recall that the 
information about integrity constraints is recorded in the metadata and thus 
needs only be represented suitably. The designer then can select some of these 
constraints and can impose these constraints to his original query. The query 
result still would be the same, but result tuples violating the selected con- 
straints are highlighted by, e.g., a certain coloring schema. The query result 
then can be reduced to those tuples from R that violate selected constraints 
and the designer then can investigate the origin of these tuples. For this, the 
query processor can provide information about the integration of data from 
local relations that build up the global relation R. Knowing the origin of tuples 
that violate integrity constraints can lead to improvements of data quality and 
integrity at local component databases. 

The above examples and discussions show that many sophisticated global 
query processing concepts are possible from which we have sketched only a 
very few. A development of a complete query language with a precise seman- 
tics as well as associated efficient query processing techniques are subject to 
future research. However, the main idea of data quality based query processing 
as outlined above is quite obvious: data quality aspects and their consider- 
ation in query formulation and query processing is semi-transparent to the 
designer. That is, the designer is not responsible for reducing general data 
integration rules to certain unions of local relations. However, the designer 
is aware of the fact that a global relation contains data of different quality. 
He can use this information (provided by the query interface) to build global 
views that retrieve only data satisfying certain quality goals. These views then 
can finally be associated with different global applications. For these applica- 
tions, the existence of data having different quality then is totally transparent. 
Depending on the type of applications, however, it is still possible to represent 
different data quality and integrity aspects in a suitable fashion. 
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5 CONCLUSIONS AND FUTURE WORK 

In this paper, we have outlined a framework for modeling and managing data 
quality and integrity aspects in database integration. In particular, we have 
motivated that existing approaches to data integration typically fail to ad- 
dress these issues, because of the assumption that conflict resolution and data 
integration rules are of static nature. As shown by means of several examples, 
however, aspects such as outdated or incomplete data are often rather of dy- 
namic nature. That is why we have introduced a taxonomy of data quality 
and integrity aspects that can be used to specify (time-varying) statements 
about data quality among components databases and relations at federation 
design time. The respective information, which is stored in the federation’s 
metadata repository, then can be (1) employed by the designer to specify data 
quality goals for global queries and (2) used by the query processor to dynam- 
ically build data integration rules from general ones that are tailored to the 
specified data quality goals. 

In our future research we want to address the following issues: 



• A complete formal specification of data quality statements and an (in- 
cremental) modeling approach. The latter issue is important because we 
cannot expect a designer to detect and specify all data quality aspects at 
federation design time. It seems more appropriate to investigate deficiencies 
in the data quality reported by global users and to record corresponding 
information in the metadata repository. This would lead to an incremental 
data quality improvement strategy. 

• An extension of a global query language, such as SQL, which allows users 
and designers not only to formulate different query goals, but also to rep- 
resent retrieved data having different quality in a suitable manner. That is, 
representation techniques for multiresolution queries need to be developed. 
The extension of a global query language includes a well-defined syntax and 
semantics of data quality and data integrity related language constructs. 

• Efficient query processing techniques for tailoring general data integration 
rules to rules which integrate only data satisfying data quality goal(s) that 
are specified in the global query. 



In conclusion, we are convinced that the role of data quality in database in- 
tegration (in addition to data integrity) needs much more attention in order to 
ensure correct and meaningful integrated data. The “traditional” assumption 
that data conflicts such as outdated data, incorrect data etc. among multiple 
information sources can always be resolved by unique data integration rules 
mainly fails. The main reason for this is that in todays database applications 
data typically change with time and thus inconsistencies and data quality 
properties are of dynamic nature. 
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Abstract. 

This paper presents a model to specify integrity policies for databcise mcinage- 
ment systems. This model maikes it possible to (1) assign an integrity level to each 
user -this integrity level depends on the data this agent is authorized to update, (2) 
define updating permissions cind prohibitions associated with each user -in particular 
we show that permission and prohibition to update may be independent from the 
user’s integrity level, (3) define a policy to mcinage how integrity evolves in time. 
Our model is compcired with clcissical approach, such as Biba cind Clark- Wilson. In 
p 20 'ticul 2 ir, we do not follow Biba: in our model, a subject may be authorized to 
update data even if its integrity level is not higher thcin or equal to the integrity 
level of the data. 
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1. Introduction 

With confidentiality and availability, integrity is one of the three well- 
known objectives of security which must be taken into account in 
information systems such as database management systems. Neverthe- 
less few works have addressed this property, compared to the property 
of confidentiality. One famous integrity model is Biba’s one (Biba, 
1976). This early model is directly derived from the Bell and LaPadula 
model (Bell and LaPadula, 1975): every object and subject is associated 
with an integrity level. The set of integrity levels is associated with a 
partial order relation. Biba’s model can then be summerized by the 
two following constraints: (1) a subject is allowed to write an object 
only if the integrity level of this subject is higher than or equal to the 
integrity level of this object, (2) a subject is allowed to read an object 
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only if the integrity level of this subject is lower than or equal to the 
integrity level of this object. 

In the context of databases, Biba’s model does not always fit with 
real requirements. In particular, a subject may need to update data 
even if the subject’s integrity level is not higher than or equal to the 
integrity level of the data; moreover, in many applications, its updating 
rights may not directly depend on its integrity level. 

In this paper, we first claim that there is a need for a temporal 
database to overcome these difficulties. Our objective is then to suggest 
a formal model of an integrity policy to manage updating in a temporal 
database. For any given database, we want our model to be able to: 

— assign an integrity level to each user, depending on the data it is 
authorized to modify/update, 

— define updating permissions and prohibitions for each user, 

— define a policy to manage how integrity evolves in time. 

The remainder of this paper is organized as follows. Section 2 will 
introduce our main motivations for this work. We show within two 
simple examples that Biba’s model is clearly insufficient in concrete 
applications, and that there is a real need for temporal databases if 
one wants to manage integrity of data. From these examples, we will 
derive the main ideas which support our model. Section 3 explains 
our approach and justifies the different steps we use to achieve our 
objectives. In the fourth section the concept of multi-reliable database is 
defined to address the property of integrity. Roughly speaking, this con- 
cept may be viewed as similar to the well-known concept of multilevel 
database which has been suggested to manage a multilevel confiden- 
tiality policy. Section 5 presents a logical formalization for the notion 
of reliability of an agent which has to update data in a multi-reliable 
database; for a given agent, reliability depends on the nature of the 
data which must be updated. Moreover, the integrity of a given data is 
not static but may evolve in time. For this purpose, section 6 introduces 
a predicate called revise-integrity in order to formalize revision of data 
integrity in time. Section 7 then shows how we can model an integrity 
policy which regulates data updates. In section 8, this model is applied 
to formalize the two initial examples of section 2. Finally section 9 
concludes this paper by comparing this work with other former works, 
and investigating several issues. 




233 



N# 


POS 


Validity 


Integrity 


sh2 


pos\ 


[12:00, 12:01[ 


HI 


sh2 


pos2 


[12:01, 12:02[ 


LI 


sh2 


posZ 


[12:02,12:03[ 


LI 


sh2 


pos60 


[12:59,13:00[ 


LI 


sh2 


pos61 


[13:00,13:01[ 


HI 


sh2 


pos62 


[13:01, now] 


LI 



Figure 1. Inst 2 Uice of the relation Position 

2. Motivations 

This section presents two examples to show some insufficiency in Biba’s 
model, and some desirable properties for the concept of integrity. 

2.1. Example 1 

Let us consider two ships shl and sh2^ with shl continuously trying 
to know sh2^s position, shl receives information from two sources: a 
satellite s which provides highly reliable data on each hour, and a 
lower reliable radar ron board of shl^ which provides information each 
minute. In a situation of emergency, the estimation of s/i2’s position 
must be sufficiently precise, so that it cannot be based only upon the 
data coming from the satellite each hour; therefore, shl must take both 
data provided by s and r into account. 

Using a temporal database, we represent the position of sh2 with a 
relation Position: its attributes respectively represent the identifier of 
a ship, the value of its position, the time during which a given position 
is supposed to be valid, and the integrity level of this position. Figure 
1 gives a set of instances of this relation. 

We can notice that: 

— Here, as in most temporal databases, sh2'^s position is assumed not 
to change as long as this position is not updated (each minute, in 
our example); it is a default assumption. For this purpose, a special 
value now is used, with the meaning ‘True until changed”. 
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— Despite Biba’s model, data provided by r updates data provided 
by s, although r’s integrity level LI (Low Integrity) is lower than 
the integrity level HI (High Integrity) of data s provides. As a 
matter of fcict, in this application, it would not be sufficient to 
only deal with data provided by s. This example shows that in 
some applications, there is a real need for authorizing updates even 
though the integrity level of the subject performing the update is 
lower than the integrity level of the updated object. 

— We need to keep track of high level data coming from s. Such 
records may be useful to compute trajectory of sh2 or to check 
for consistency of data coming from s and r. This goal may be 
accomplished by using a temporal database. 

— When data are provided by s and r at the same time, the database 
only keeps data coming from s because s is more reliable. In this 
case, data coming from s has priority over data coming from r. 

— Now let us assume that the satellite s is the only source of informa- 
tion providing shl with the position of sh2. Most of classical tem- 
poral databases would consider that the position of sh2 remains 
unchanged until s provides new data (event which only occurs each 
hour). However, since sh2 moves in time, it would not be safe to 
consider that the position provided by s is always highly reliable 
during one hour. Figure 2 provides a more realistic representation 
of the integrity of data provided by s. 

In this figure, the integrity level of the fact “shS’s position is pos” 
becomes lower and lower in time, with a degradation speed depend- 
ing on some parameters such as for instance sh2's supposed speed. 
Of course the position is highly reliable at the beginning of each 
hour, that is when the position is updated by the satellite. Then, 
the integrity level of sh2's position changes in time. In figure 2, it 
is assumed that this integrity decreases at medium (MI) after one 
minute, and then at low (LI) after three minutes. 
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Figure 2. Integrity decrecising of the position in time 



2.2. Example 2 

Let us assume we are in a company where a secretary si has to manage 
and update employees’ salaries, si has a high integrity level for this 
task. Another secretary s2 has to manage employees’ vacation. s2 has 
a high integrity level for this task. 

Now, let us assume that during si’s vacation, a temporary secretary 
s3 is hired and is authorized to update the salary data file, with a 
medium level of integrity. On the other hand, s3 has also to replace 
s2 during s2’s vacation, so that s3 is also authorized to update the 
vacation data file, with a high integrity level. 

In this example, we can see that: 



— The integrity level of an agent considered as a source of information 
may depend on the nature of data this agent updates; 



— The update rights of an agent do not depend on this agent’s 
integrity level: despite s2 could be very reliable to update the salary 
data file, it is not authorized to do so; however, even though s5’s 
integrity level is medium, it is allowed to update this file during 
si’s vacation, and is not allowed to do so at any other time. 



The agent’s rights for a task may depend on time; s3 is allowed to 
update the salary file during si’s vacation only. 
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2.3. Requirements supporting our approach 
From the two above examples, we claim that: 

1. The agent’s reliability may depend on the data it updates. 

2. The agent’s reliability may change in time. 

3. The agent’s updating rights may not depend on its reliability. 

4. Sometimes, an agent must be able to update a data even if its 
reliability level is lower than the reliability level of the data which 
is updated (despite Biba’s model). 

5. The agent’s updating rights may change in time. 

6. It is necessary to introduce a temporal representation of data to 
prevent a data with a low integrity level from erasing a data with 
a higher integrity level. 

7. Without any update, a data integrity level may decrease in time, 
according to some characteristics of a given application. 



3. Main steps of our approach 

We first define the concept of multi-reliable database. It is similar to 
the concept of multilevel database to deal with the property of con- 
fidentiality: data have labels which denote their level of integrity. We 
shall consider that a multi-reliable database is a set of beliefs and we 
shall define what does it means for a multi-reliable database to believe 
that a given data is associated with a given level of integrity. 

As for confidentiality, we also assign an integrity (or reliability) level 
to each agent which is responsible for updating the database, i.e. which 
can be considered as a source of data: in our model, the integrity level of 
an agent depends on the data it updates. We also use temporal notions 
to model the fact that the integrity level of an agent may change in 
time. 

We also have to express that a data integrity level may change in 
time. For this purpose, we shall introduce an event revise-integrity: the 
occurrence of this event automatically updates the level of integrity of a 
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given data. Then it becomes possible to define, according to insertions, 
deletions or revisions of data, when a multi-reliable database believes 
that a given data has a given level of integrity. 

Then we suggest a formalization for the concept of integrity policy. 
We claim that an integrity policy has three components: 

1. The first one is the assignment of reliability levels to agents, depend- 
ing on the data they may update. 

2. The second one is a regulation of rights and prohibitions of agents 
towards insertions and deletions in the database. 

3. The last one is a set of rules specifying how the integrity level of 
data has to be revised. 

In this paper, we do not deal with integrity constraints, i.e. rules 
which must be satisfied by any state of the database. This does not 
mean that we consider that integrity constraints are not included in the 
concept of integrity. In particular, we guess there is a clear connection 
between integrity constraints and the notion of well-formed transaction 
suggested in the Clark- Wilson model (Clark and Wilson, 1987; Clark 
and Wilson, 1989). Therefore, despite we do not take them into account 
in our model for an integrity policy, such rules have to be specified and 
enforced by the multi-reliable database. 



4. Formalization of a multi-reliable database 

Let IL be a set of integrity levels associated with a partial order relation 
denoted < (that gives a lattice structure). If n\ and ri 2 are two integrity 
levels of IL^ n\ < n 2 means that n\ is lower than or equal to ri 2 . For 
instance, if HI (high integrity), MI (medium integrity) and LI (low 
integrity) are in /T, then we have LI < MI < HI. We also introduce an 
unary predicate Level with formula of the form Level (n) to be read “n 
is an integrity level” . 

We denote <3 the strict order relation derived from <; we have the 
following axiomatics for <1: 



Vni , Vri 2 , ni <3 n 2 ni < ri 2 A -n(ni = 722) 
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Now we consider that a multi-reliable database is composed of sev- 
eral single-level databases; each of them is assigned with an integrity 
level n and is the set of all the data which are explicitly assigned 
with an integrity level higher than or equal to n. We also consider 
that each single-level database represents a set of beliefs. Therefore, 
to represent the content of each single-level database, we introduce a 
modality believe with formula of the form believe{n,p) to be read : 
“the database with integrity level n believes that the information p is 
valid” . 

The axiomatics associated with believe{n,p) is a KD logic (Chellas, 
1988 ): 

— K: {believe{n, p) A believe{n, p—^q))—^ believe(n, q) 

(closure of believeness within the database of integrity level n) 

— D: believe (n,p) -ibelieve{n,-<p) 

(consistency of the database associated with the integrity level n) 

— N: If p is a theorem then Vn e IL, believe{n,p) is also a theorem, 
(the database associated with level n believes all theorems) 

We also have the following additional axiom: 

— Vn, Vn', {believe{n, p) A n' < n) —> believe{n' , p) 

(if the database of level n believes p then all databases of level n' 
lower than n also believe p) 

Then we assign an integrity label to every information derived from 
the multi-reliable database. For this purpose, we introduce a meta- 
predicate safetyJevel, where safetyJevel{p,n) is to be read: ” the 
integrity level of information p is n” . The axiomatics associated with 
sa fetyJevel{p, n) is: 

— Vn 6 IL, sa fetyJevel{p, n) •«-> 

{believe{n, p) A Vn', n<] n' ->believe{n',p)) 

(the integrity level of p is n iff the database of level n believes p 
and all databases with level n' higher than n do not believe p). 

Finally, we check that the database at level n does not believe p by 
testing omission of explicit believe of p, that is: 
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- If 1/ believe{n^p) then I — ibelieve{n^p) 

In the remainder of this paper, we shall actually consider that our 
multi-reliable database contains temporal data. For this purpose and 
following Sripada (Sripada, 1993), we shall introduce two additional 
modalities holdjit and hold with formula of the form holdjat{p^T) to 
be read ‘‘p is valid at time T” and formula of the form hold{p^ [Ti, 72 ]) 
to be read ‘‘p is valid on the interval of time [Ti,T 2 ]”. For instance, 
formula: 

— holdjCLt{believe{HI^ holdjoit{Position{sh2^ posl) ^ 12 : 00)), 13 : 00) 

is to be read: ‘‘at time 13:00, the database associated with the integrity 
level HI believes that the position of sh2 is posl at time 12:00”. In this 
formula, time 12:00 is generally called a valid time (or an historic time) 
and time 13:00 is called a transaction time (or a belief time) (Snodgrass 
and Ahn, 1985). 

Due to space limitation, we do not develop the axiomatics associ- 
ated with modalities holdjat and hold^ but see (Sripada, 1993) and 
(Cuppens and Saurel, 1998) for a more detailed presentation. In the 
following, we shall also use a special value now with formula of the 
form holdjat{p^ now) to be read “p is valid until changed!'* (see (Clifford 
et al., 1997) and (Cuppens and Saurel, 1998) for a detailed presentation 
of the semantics and axiomatics associated with now). 

The multi-reliable database we consider contains temporal data hav- 
ing the form hold{believe{n^ hold{p^ Ii))., I 2 ) where n is an integrity 
level, p is an atomic formula and I\ and I 2 are two time intervals. 
To represent modification performed in a multi-reliable database, we 
shall consider the following two meta-predicates: 

- insert. Formulae of the form insert{a^p^ I) are to be read “agent 
a has inserted in the multi-reliable database the fact that p is valid 
on the interval /”. 

— delete. Formulae of the form de/ete(a,p, /) are to be read “agent 
a has deleted from the multi-reliable database the fact that p is 
valid on the interval /” . 

We could also define a third operation update. However, this last 
operation may be viewed as a combination of delete and insert. There- 
fore, it is not necessary to include this operation in our model. 
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5. Reliability of an agent 

Here we formalize the notion of reliability of agents who update the 
multi-reliable database. As we want the reliability of an agent to depend 
on the updated data, we introduce a meta-predicate safer Jhan where 
formula of the form safer.than{a,p,n) is to be read ; “the integrity 
level of data p is higher than or equal to n when it is the agent a who 
inserts p in the database” . 

Actually, we consider that agents always insert, in the database, 
data of the form hold{p,[Ti,T2]). Therefore, we suggest the following 
axiom to express what is the reliability of an agent when this agent 
inserts some data in the multi-reliable database: 

- Va, VTi, VT2,Vn, sa/erJ/ian(a, ho/d(p, [Ti, T2]), n) 

believe{n, insert{a^p, [Ti, T2]) hold{p, [Ti, T2])) 

i.e. safer. than{a, hold{p, [Ti, J2]), n) is true iff the database asso- 
ciated with the integrity level n believes that if the agent a inserts 
hold{p,[T\,T2]) in the database, then p is actually valid on the 
interval [Ti,r2]. In other words, safer.than{a,hold{p,[Ti,T2]),n) 
is true iff the database of level n believes what agent a tells about 
hold{p, [Ti,T 2 ]). 

In the following, we shall also use the predicate insert' defined as 
follows: 

— insert' (p, /, n) is to be read “data hold{p, I) is inserted in the 
database with an integrity level greater than n” . We have: 

VTi, VT2, Vn, insert' {p, [Ti, T2], n) 

3 a, insert{a, p, [Ti, T2]) A safer.than{a, hold{p, [Ti, T2]), n) 



6. Evolution of a data integrity level in time 

To represent the fact that the integrity level of data may change in 
time, we introduce a predicate revise. integrity with formula of the 
form revise.integrity{p, /, n) to be read: “the integrity level of data 
hold{p,I) is revised at level n”. 

From predicates insert' , delete and revise. integrity, we define three 
other predicates insertion, deletion and revisionjintegrity as follows: 
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Intuitively, insertion{p^ T, n) means “data holdjat{p^ T) is inserted 
in the database associated with the integrity level n” . If T is a date 
(not the special value not/;), then there are two possible cases: 

1. Data hold{p^\T\^T< 2 \) is inserted in the database of integrity 
level n and time T is between Ti and T 2 . 

2. Data hold{p^ [Ti, now]) is inserted in the database of integrity 
level n, and T is between T\ and the date of reference. Roughly 
speaking, the date of reference represents the “present date”; 
it makes it possible to interpret the special value now (see 
(Cuppens and Saurel, 1998)). 

Therefore, we have: 



VT £Date^ Vn, insertion{p^ T, n) 

3Ti, 3T2, (insert' (p, [Ti, T 2 ], n) A Ti <T AT :< T 2 ) 
V (insert' (p, [Ti, noto], n) A 

Reference.date{T 2 ) AT\ <T AT < T 2 ) 



If T is equal to noto, we have: 

Vn, insertion{p^ now^ n) 3T, insert^p^ [T, non;], n) 

deletion{p^ T) means “data holdMt{p^ T) is deleted from the multi- 
reliable database”. Following the definition of insertion^ if T is a 
date, we have: 



VT eDate^ deletion{p^T) 

3Ti, 3 T 2 , 3a,(de/ete(a,p, [Ti, T 2 ]) A Ti ::< T A T T 2 ) 
V (de/ete(a,p, [Ti, nou;])A 

Referenc€^date{T 2 ) AT\ < T AT :< T 2 ) 



And if T is equal to now: 

deletion{p^ now) 3a, 3T, delete[a^ p, [T, now\) 

revision Jntegrity[p^T^n) means “the integrity level of data 
holdjat{p^T) is revised at level n”. Following the definition of 
insertion^ if T is a date, we have: 
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Vr £Date^ Vn, revision Jntegrity{p^ T, n) 

3Ti^3T2^{reviseJntegrity{p^ [Ti, T2], n) A Ti <T /\T < T2) 

V {revise Jntegrity{p^ [Ti, not/;], n)A 

Ref erence.dat e{T2) A Ti <T t\T <T2) 

And if T is equal to now: 

Vn, revision Jntegrity{py now^ n) 

3 T, revise Jntegrity{p^ [T, notu], n) 

Now, we introduce two additional predicates apparent-integrity and 
real-integrity. These two predicates are defined as follows: 

-- apparent-integrity{p^ T, n) means “the apparent integrity of data 
holdjat(p^t) is higher than or equal to n”; that is, at a previous 
date, the integrity of hold-at{p^T) was higher than or equal to n, 
but it is possible that, since this date, data holdjit{p^T) or its 
integrity level has been updated. 

Vn, apparent-integrity{p^ T, n) 

{insertion{p^ T, n) 

V 3 n', revisionJntegrity{p^ T, n') A n < n') 

- real-integrity{p,T^n) means “the integrity of data hold-at{p^T) 
is higher than or equal to n”. 

VT,Vn, realJntegrity{p^ T, n) 3 Ti, 3 T 2 , 

{Re ference-date{Ti)A 

holdjat{apparent-integrity{p^ T)^T2) AT2 < T\A 
-•3T3, (72 ^ 7 a A 73 Ti A holdjat{deletion{p^ T), T3))A 
-•3T3, 3 n', (T2 ^ Ta A T3 Ti 

A /io/d^^(reuiszonJn^e^rz 7 y(p, T, n'), T3) A n' <1 7 l)) 

that is: the integrity of data hold,Mt{p, T) is higher than or equal 
to n if and only if: 

1 . Data holdjat{p,T) had an integrity level higher than or equal 
to n at a date T2 which is before the reference date Ti. 

2 . Data hold.at{p, T) was not deleted between T2 (excluded) and 
T\ (included). 
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3 . The integrity level of data holdMt{p, T) was not updated at an 
integrity level n' strictly dominated by n between T2 (excluded) 
and T\ (included). 

We can now define in which situation the database associated with 
the integrity level n believes that data holdjit{p^T) is valid. For this 
purpose, we suggest the following axioms: 

- VTi,Vr2, VT, Vn G /T, [Referencejdate{T2) /\Ti < T2) -> 

holdMt{believe{n^holdjit{p^T) ^ realJntegrity{p^T^n))^Ti) 

that is: if T\ is a date before the reference date T 2 » then the 
database at level n believes at time T\ that the fact p (where 
p represents any atomic fact the multi-reliable database may con- 
tain) is valid at time T if and only if the real integrity level of data 
holdjat{p^ T) is higher than or equal to n. 

- VTi, VT2,VT, Vn G /T, [ReferenceJiate[T2) AT2 < T\) 

holdjat{-^believe{n^ holdjit{p^ T)), Ti) 

that is: if T\ is a date strictly after the reference date T2, then the 
database at level n does not believe, at time Ti, that p is valid 
at time T (where p represents any atomic fact the multi-reliable 
database may contain). This axiom corresponds to the assumption 
that a transaction time must always be before the reference date. 



7. Integrity policy 

We now want to formalize an integrity policy which regulates updates 
in a multi-reliable database. This is done in three steps: 

1 . Assignment of integrity levels to the users of the multi- reliable 
database. This is done by using the predicate saferJhan{a^p^n) 
introduced in section 5 . 

2 . Specification of rights and prohibitions for agents who can insert 
or delete data in the database. For this purpose, we introduce two 
deontic modalities P (permission) and I (prohibition). 

For instance, the formula 
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P insert{Clementine, Iri-vacation{Jane), [1/11/97, 15/11/97]) 

is to be read ” Clementine is permitted to insert in the database the 
fact: Jane is in vacation from November, 1st to November, 15th. 

3. Definition of the policy which regulates the evolution of integrity in 
time. The objective is to give rules which specify how data integrity 
has to be revised. For this purpose, we introduce a third modality 
O where formula of the form O revise Jntegrity{p, [Ti, T 2 ], n) is to 
be read “the integrity level of hold{p, [Ti, T-^) must be revised to 
the level ra” . 



The axiomatics of deontic modalities P, I and O is defined as follows: 

— O (obligation) is associated with the axiomatics of SDL (Stan- 
dard Deontic Logic) (Meyer and Wieringa, 1991), which ax:tually 
corresponds to KD logics for O: 



K: (Op A 0(p q)) — >• Oq 

(if p is obligatory and if p — g is obligatory then q is obliga- 
tory). 

D: Op -> -'O-'p 

(if p is obligatory then -ip is not obligatory). 

N: If p is a theorem then Op is also a theorem. 

(all theorems are obligatory). 



- def „ 

— Ip = O-ip 

(p is forbidden iff the negation of p is obligatory) . 

r» T 

— Pp = -<Ip 

(p is authorized iff p is not forbidden) . 

Using the formalism we have just defined, we can now specify an 
integrity policy as follows: 



1. Definition of predicate saferJhan and, 

2. Definition of conditions under which P inserf(a,p, [Ti, T 2 ]) and 
P de/efe(o,p, [Ti, T 2 ]) hold and. 
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3. Conditions under which O revise Jntegrity{p^\Ti^T< 2 \^n) holds. 

Moreover we make the following closed-world assumption: if an agent 
a is not explicitly authorized to insert or delete [Ti, T 2 ]), then 

a is forbidden to do so. This means that an agent must be explicitly 
authorized to perform any insertion or deletion in the database. 

As mentioned before, we have also to deal with integrity constraints. 
Integrity constraints may be either state constraints (e.g, the position 
of a ship is unique at a given time), or transition constraints (e.g, in 
an hour, a ship cannot move more than 60 miles). In our approach, we 
assume that the state of the database reached when performing any 
insertion or deletion satisfies the set of integrity constraints: if it does 
not, the insertion or deletion is rejected. 

The next section shows how to specify examples of integrity policies 
formalized within our language. 



8. Examples of formalized integrity policies 
8.1. Example T. Position of a ship 

We use the following predicate symbols to formalize this first applica- 
tion. 



— Ship. Ship{x) means “x is a ship”. 

— ValJ^os. ValJ^os{y) means ‘‘y is the value of a position”. 

— Position. Position{x.,y) means “the position of x is j/”. 

We then specify one typing constraint: 

— Vx, Vy, Posih’on(x, y) Ship{x) A Val-Pos{y) 

and one integrity constraint, claiming that the position of a ship is 
unique at any given time: 

— Vx, Vy, Vy', Position[x., y) A Position[x., y') — ^ y = y' 

Let us assume we have two information sources, a satellite s and a 
radar r, which is supposed to be jammed. 

The first step in our approach consists in specifying their level of 
reliability as follows: 
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— Vx, Ship{x) A Va/_Pos(y) A Date{T) 

safer Jhan{s, hold{Position{x, y), [T, now]), HI) 

(the satellite is highly reliable when it tells about the position of 
a ship). 

— Vx, Vy,VT, Ship{x) A ValJ^os{y) A Date{T) A Jammed{r) 

saferJhan{r, hold{Position{x , y), [T, now]), LI) 

(if the radar is jammed, then its integrity level is low when it tells 
about the position of a ship) . 

The second step consists in defining the update policy: 

— Vx, Vy,VT, Ship{x) A ValJPos{y) A Date{T) 

P insert{s, Position(x, y), [T, now]) 

the satellite is allowed to insert the position of any ship. 

— Vx, 'iy,MT, Ship{x) A ValJ^os{y) A Date{T) 

P insertfr, Position{x, y), [T, now]) 

the radar is also allowed to insert the position of any ship. 

— Vx, 'iy,'iT,Ship{x) A ValJ^os{y) A Date{T) 

P delete{s, Position{x, y), \T, now]) 

the satellite is allowed to delete the position of any ship. So by 
combining both rights for insertion and deletion, the satellite is 
then allowed to update the position of any ship. 

— For the radar, the situation is more complex. Since the satellite 
is more reliable than the radar, we do not want the radar to be 
allowed to update any data which has just been given by the satel- 
lite. However, we are interested in getting information from the 
radar when information coming from the satellite becomes obso- 
lete. So the policy allows the radar to update (by two successive 
deletion and insertion operations) a position given by the satellite 
only after a given time, say after at least one minute. So we have: 

Vx,Vy,VT, 

Ship{x) A Val-Pos{y) A Date{T)A 

->3Ti, 3T2, (holdjit{insert{s, Position{x, y), [Ti, now]), T 2 )A 

T :< Ti -I- 1) 

P delete{r, Position{x, y), [T, now]) 
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with Ti + 1 being an operation which adds one minute to time Ti 
(Cuppens and Saurel, 1997). 

Then the third step consists in giving the integrity evolution policy. 
We first define a predicate maxJiold{p,[Ti,T 2 ]) as follows (see also 
Sripada (Sripada, 1993)): 

- VTi, VT 2 , maxJiold{p, [Ti, T 2 ]) 

hold{p,[Ti,T2]) 

A T T\ A hold(jp^ [T, 

A T 2 -< r A hold{p, [Ti, T]) 

that is: we have maxJiold{p, [Ti, T 2 ]) if and only if p is valid on 
interval [Ti,T 2 ] and there is no interval I containing [Ti,T 2 ] such 
as p is valid on I. 

The integrity evolution is then defined as follows: 

- VT,VTi,Vx,Vy, 

Reference-date{T)A 

hold.at{believe{H I , maxJiold{Position{x, y), [Ti, notu])), T)A 
Ti + 1 :< T 

-> O reviseJntegrity{Position{x^ y), [Ti + 1, now]^ MI) 

that is: if the database believes at time T that the position of the 
ship has been highly reliable since time Ti, and if the reference 
date is more than one minute after Ti, then the level of integrity 
of the position of the ship has to be revised to the medium level 
of integrity. 

Vr,VTi,Vx,Vy, 

Datejrefevence{T)l\ 

holdjit{helieve{MI ^ maxJiold{Position{x^ y), [Ti, note])), T) A 
Ti + 2 ^ T 

revise Jntegrity{Position[x ^ y), [Ti + 2, notu], LI) 

(similar to the previous one: the level of integrity of the position 
has to be revised to a low level of integrity after two minutes). 

Let us see how this integrity policy applies to a practical example. 
Let us assume that at 12:00, the satellite s provides the data: “the 
position of the ship sh is posl at 12:00”. 
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— hoi djat {insert {s^ Position{sh^ posl) ^ [12:00, now])^ 12:00) 

Since s is allowed to insert this data, the database includes the 
following data : 

— hold{helieve{H I ^ Position{sh^ posl)^ [12:00, notu]), [12:00, now\) 

According to the policy, the radar is not allowed to update this data 
between 12:00 and 12:01. So, let us assume that at 12:01, r provides 
data corresponding to: the position of sh is pos2 at time 12:01. As an 
update corresponds to a deletion followed by an insertion, we have : 

— holdjat{delete{r^ Position{sh^ pos2) ^ [12:01, now])^ 12:01) 

— hoi djat {insert {r^ Position{sh, pos2) ^ [12:01, now]), 12:01) 

Since these operations are allowed for the radar r, the database then 
contains the facts : 

— hold{helieve{H I , Position{sh,posl), [12:00, now]), [12:00, 12:01[) 

— hold{believe{H I, Position{sh, posl) ,[12:00, 12:01[), [12:01, now]) 

~ hold{believe{LI, Position{sh,pos2), [12:01, now]), [12:01, now]) 
and so on. 

Now, let us assume that rdoes not provide any information between 
12:00 and 12:05, then according to the integrity evolution policy the 
state of the database at time 12:05 will be: 

— hold{believe{HI, Position{sh, posl), [12:00, now]), [12:00, 12:01[) 

~ hold{believe{HI, Position{sh, posl) ,[12:00, 12:01[), [12:01, now]) 

— hold{believe{MI, Position{sh, posl), [12:01, now]), [12:01, 12:03[) 

— hold{believe{M I , Position{sh, posl) , [12:01, 12:03[), [12:03, now]) 

— hold{believe{LI, Position{sh, posl) , [12:03, now]), [12:03, now]) 



8.2. Example 2: In a company 
We introduce the following predicates: 
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— Employee{x) : x is an employee. 

— Secretary{x) : a: is a secretary. 

~ Temporary{x) : a; est a temporary employee. 

— In^vacation{x) : x is in vacation. 

— ValJSal{x) : X is db salary value. 

— Salary{x^ y) : x’s salary is y. 

Here are integrity constraints: 

— "^x^ Secret ary (x) — > Employee{x) 

— Vx, Vy, Salary{x^ y) — > Employee{x) A ValJSal{y) 

— Vx, Vy, 'iy',Salary{x, y) A Salary{x, y') -)■ y = y' 

Now, we consider we have three agents: two secretaries si and s 2 , 
and a temporary employee s 3 . We first specify their respective reliabil- 
ity: 



- Vx, Vy,VT, Employee{x) A ValJSal{y) A Date{T) 

— > safer Jhan{sl^ hold{S alary {x^ y), [T, now])^ HI) 

{si is a high reliable source for salaries) 

- Vx, VTi,VT2, Employee{x) A Date{Ti) A Date{T2) 

-> safer Jhan{s 2 ^ hold{I njoacation{x) ^ [Ti, T2]), /f/) 

(52 is highly reliable for vacations. T2 must be a date, not the 
special value now) 

- Vx, Vy,VT, Employee{x) A ValJSal{y) A Date{T) 

safer Jhan{s 3 ^ hold{S alary {x^ y), [T, now]) ^ MI) 

(s 3 is a medium reliable source for salaries) 

- Vx, VTi,VT2, Employee{x) A Daie(Ti) A Dafe(T2) 

— > safer Jhan{sS^ hold{Injuacation{x)^ [Ti, T2]), /f/) 

(53 is highly reliable for vacations) 



The second step consists in defining the updating policy as follows : 




Vx, \/yyT^ Employee{x) A ValJSal{y) A Date{T) 

P insert{sl^ S alary {x^ y)> [^? now]) 

Vx, Vy,VT, Employee{x) A ValJSal{y) A J 9 a^e(T) 

— > P delete[sl^ Salary[x^ y), [T, now]) 

{si is allowed to update (insert then delete) salaries of employees) 

Vx, Vri,VT2, Employee{x) A Date{Ti) A Z)a^e(r2) 

“> P insert{s 2 ^ Injuacation{x)^ [Ti, T2]) 

Vx, VTi,Vr2, Employee{x) A Date{T{) A Date{T2) 

P delete{s 2 ^ Injuacation{x)^ [Ti, T2]) 

(52 is allowed to update data corresponding to the vacation of 
employees) . 

Vx,Vy,VT, 

Employee{x) A ValJSal{y) A Z?a^e(T)A 
/n_t;acah’on(sl) 

— )> P mserf(s 3 , Salary{x^ J/)> not/;]) 



Vx,Vy,VT, 

J5'mp/oyee(x) A ValJSal{y) A Da^e(T)A 
/n_t;acah’on(sl) 

— > P de/ete(s3, 5a/ary(x, y), [T, notoj) 

(during si’s vacation, s3 is allowed to update salaries of employ- 
ees). 



Vx,VTi,VT2, 

Employee{x) A Date{Ti) A Daie(T 2 )A 
/n_t;aca^ion(s2) 

P insert{s 3 ^ Injuacation{x)^[Ti^T2]) 



Vx,VTi,VT2, 

Employee{x) A Da^e(Ti) A Da^e(T 2 )A 
/n_t;aca^ion(s2) 

P delete{s 3 ^ Injt)acation{x)^\Ti^T<;^) 

(if s2 is in vacation, then s3 is allowed to update data correspond- 
ing to the vacation of employees). 
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Finally, we assume that there is no rule to specify how data integrity 
changes in time. Therefore, the integrity policy is now completely 
defined. 



9. Conclusion 

In this paper, we have presented a formal model to specify integrity 
policies for database management systems. A major difference with 
previous proposals is that this model includes an explicit representation 
of time. This allows us to: 

1. Manage historical data. It is possible to update a highly reliable 
data by a lower one; the highly reliable data is not deleted from 
the database but recorded in the history. This represents a major 
difference with Biba and corresponds to real practical requirements. 

2. Label data with transaction time. This makes it possible to revise 
the integrity level of “old” data. 

3. Model integrity policies where rights may change in time (as in 
(Bertino et al., 1996)). 

We can also make the following comments when we try to compare 
our model with Clark-Wilson (Clark and Wilson, 1987). Clark- Wilson’s 
model is based on two basic concepts: 

— Separation of duty. This corresponds to the requirement that sev- 
eral different agents may be necessary to perform a given task. This 
requirement is not explicitly included in our model but we guess 
that using the concepts of permission and prohibition, it would be 
easily possible to specify this kind of notion in our model. See also 
(Sandhu, 1988) for some practical ideas to deal with separation of 
duty in database management systems. 

— Well formed transaction. This corresponds to the requirement that 
any transaction should transform a valid system state into another 
valid system state. As mentioned before, this essentially corre- 
sponds to the notion of integrity constraints which is implemented 
in most database management systems. 
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In Clark’Wilson, rights are represented by triples (agent, procedure, 
set of data items) which specifies that a given agent is allowed to access 
a given set of data by using a given procedure. Our model refines Clark- 
Wilson’s triples by including the possibility to specify rights that may 
change in time. 

In (Sandhu and Jajodia, 1991), Sandhu and Jajodia also investi- 
gate basic integrity principles and suggest mechanisms to implement 
them in the case of a database management system. Their conclusion 
is that classical database management systems already include several 
mechanisms (in particular integrity constraint management) which may 
be directly used to implement these integrity principles. They also 
mentioned that delegation of authority is an important requirement 
when specifying integrity policies. This requirement is not represented 
in our model (nor it is in Clark-Wilson’s one). This represents further 
refinement that remains to be done. 

Finally, let us mention that our approach has been used to model 
the integrity policy of a police application which manage a database 
of criminal cases. In particular, the fact that the integrity level of 
data may change in time corresponds to a practical requirement in 
this application. It is also necessary to accept low reliable clues even 
though there are already higher reliable clues in the database; and of 
course, there are various classes of agents and their rights generally 
change in time. Therefore, this application fits well with the various 
concepts which are included in our integrity policy model. 
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Abstract 

Many real-time database systems are contained in environments that exhibit 
restricted access to information, such as government agencies, hospitals and 
military institutions, where mandatory access control for security is required. In 
addition to such security constraints, real-time database systems have real-time 
integrity constraints. These real-time constraints require deadlines to be met and 
data to be temporally consistent. Conventional multi-level secure database models 
are inadequate for time-critical applications and conventional real-time database 
models do not support security constraints. We propose a new concurrency control 
algorithm for secure real-time databases. We implement the algorithm and study 
the performance using a real-time database system simulation model. Results show 
that the algorithm performs fairly well in terms of security and timeliness 
compared to the non-secure algorithm. We argue and show that achieving more 
security does not necessarily mean a great deal of sacrifice in maintaining real-time 
constraints. 
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1 INTRODUCTION 

Real-time database systems (Ozsoyglu 1995, Ramamritham 1993, Wolfe 1997) 
have real-time integrity constraints in addition to the integrity constraints found in 
conventional databases. Specifically, real-time database systems have timing 
constraints and temporal consistency constraints. Some examples of real-time 
databases are avionics, radar tracking, managing automated factories, robot 
navigation, program stock trading, and military command and control. The timing 
constraints are typically in the form of deadlines which require a transaction to be 
completed by a specified time. Failure to meet such a deadline causes the results 
to lose their value, and in some cases a result produced too late may have a 
negative value. The temporal consistency constraints require data to be up-to-date 
as well as data that is close in time. Much of the data in a real-time database is 
only valid during a specified interval. Failure to meet these temporal consistency 
constraints compromises the integrity of the real-time database. 

Many real-time database systems are contained in environments that exhibit 
hierarchical propagation of information. Such real-time database systems may 
require restricted access by users to the data. Mandatory access control is used to 
ensure the security of data in hierarchical environments, and is typically 
implemented by multilevel secure (MLS) databases (Bell 1974, Denning 1988, 
Jajodia 1991). However, major efforts to design secure MLS databases have not 
considered databases with the real-time constraints of deadlines and temporal 
consistency. A secure real-time system has to simultaneously satisfy two goals: 
ensure that the real-time constraints are satisfied and provide security. These two 
goals can conflict with each other and to achieve one goal is to sacrifice the other. 
The objective of our work is to study the factors involved in security control of 
real-time databases, develop suitable concurrency control algorithms, and using a 
real-time database simulation, study the effect on real-time integrity of maintaining 
security in real-time databases. 

All MLS models (Ramamritham 1993, Jajodia 1991) are based on the 
classification of the system elements, where classifications are expressed by 
security levels. Data objects have security levels and users have clearance levels. 
A user can read a certain object only if the subject’s clearance level dominates the 
object’s security level. According to the Bell-LaPadula properties (George 1997) 
for MLS databases, a subject cannot read an object of a higher or incomparable 
security level than the subject and all writes must take place at the subject’s 
security level or higher. However, the concurrent execution of transactions results 
in contention for data objects. As a result, it is possible to have an indirect flow of 
information from objects at higher levels to subjects at lower levels due to a covert 
channel (Moskowitz 1994, Qian 1994). For example, if the results from a lower 
security level transaction are delayed when there is a higher level security 
transaction, then the lower security level user can determine there are transactions 
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at higher levels, and may even be able to receive information from the length of the 
delay. 

Enforcing database security can compromise the real-time integrity by causing 
deadlines to be missed and data to become temporally inconsistent. For example, 
suppose there is a transaction with an earlier deadline at a high security level and a 
transaction with a later deadline at a low security level, and there is a data conflict 
between them. If the low security level transaction gets the data and blocks the 
high security transaction, then although security is maintained, the real-time 
constraints may be violated. The high security transaction has an earlier deadline, 
and due to its blocking may miss its deadline. If the reverse is allowed to happen 
to maintain the real-time constraints, then security is violated as a covert channel is 
introduced. 

Whether to maintain real-time constraints or security is dependent upon the 
system. If the system requires that security be maintained regardless of the real- 
time constraints, then conflict must be resolved in favor of security. On the other 
hand, if the system requires the real-time constraints be maintained, then security 
must be sacrificed in favor of the real-time constraints. If the system allows a 
compromise between security and priority, then the goal is to maintain as much 
security as possible without violating the real-time constraints significantly. In this 
paper we present a new concurrency control algorithm based on 2-phase locking 
(2PL). The algorithm recognizes the constraints of real-time transactions as well 
as security. The algorithm can be used for systems where security can be 
compromised for real-time constraints and vice versa, and also for systems where 
security is a correctness criteria and must be maintained. 

The rest of the paper is organized as follows. In section 2, we discuss related 
work in secure real-time concurrency control and the secure real-time factor. In 
section 3, we present the secure concurrency control algorithm and metrics for 
security maintenance, and in section 4 we describe the simulation model and the 
results. In section 5 we present our conclusions. 

2 SECURE REAL-TIME CONCURRENCY CONTROL 

The few works which address the security of real-time databases are described in 
(George 1997, David 1995, Son 1995). In (David 1995) a concurrency control 
strategy is presented which trades off security for improved timeliness if the 
system does not provide the desired deadline miss percentage. They use a measure 
called “capacity” to adjust the covert channel to get better real-time performance. 
In (Son 1995) a secure two-phase locking algorithm is used to allow partial 
violations of security for improved timeliness. Decisions are made concerning the 
trade off between security and meeting deadlines by comparing two measures to 
resolve conflicts: the security factor and the deadline miss factor. This comparison 
is used to determine if a lower level transaction should be aborted or proceed when 
it conflicts with a higher level transaction. If the security properties are more 
important, then any conflict is resolved in favor of the lower level transaction, 
otherwise if meeting deadlines is more important, then the higher priority 
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transaction is given precedence. Our work also considers the need to tradeoff 
between security and timeliness but shows that security can be achieved with 
negligible sacrifice in maintaining real-time constraints. 

In (George 1997), security of firm real-time databases is addressed. In this study, 
security is viewed as a correctness criterion, and the number of missed deadlines as 
a performance issue. They do not trade off between missed deadlines and security, 
but instead propose to minimize the number of missed deadlines without 
compromising security through the choice of concurrency control strategy. They 
examine the performance of such strategies as a two-phase locking priority 
scheme, a prioritized optimistic concurrency control algorithm and a new 
approach, called the dual approach. The dual approach utilizes different 
concurrency control strategies depending on whether transactions are at the same 
security level or at different security levels. In contrast, our algorithm is capable of 
handling transactions both at the same and different security levels. 

2.1 Secure real-time factor 

The concurrency control algorithm for a secure real-time database system must use 
security levels of transactions as well as their deadlines to resolve data conflicts. 
Not only that, the difference in security levels of transactions also needs to be 
considered when a security conflict is to be resolved. If one transaction is at the 
highest security level and the another one is at the lowest and a covert channel is 
introduced between them, then the severity of the covert channel will be higher 
than the one where transactions are at adjacent security levels. None of the 
previous works discussed in the previous section recognizes the difference in 
security levels as a measure to resolve a security conflict. In secure real-time 
database systems, where satisfying real-time constraints is one of the goals to be 
achieved, one cannot afford to sacrifice it for some covert channel not severe 
enough to be concerned about. Thus, we believe that difference in security level is 
an important issue for determining whether a covert channel has to be closed by 
possibly sacrificing real-time constraints. To be able to determine the severity or 
consequences of security violations, we introduce the following “covert channel 
property” for secure real time database systems. 

Covert Channel Property: The greater the difference of security levels between two 
transactions at data conflict, the greater the severity or the consequence if a covert 
channel is introduced. 

The covert channel property indicates that consequence is proportional to the 
difference in security or access levels. In other words, the greater the difference in 
access levels, the more important it is to maintain security and close the covert 
channel. If two conflicting transactions are at two extreme security levels, the 
consequence of opening a covert channel is the maximum. If the two conflicting 
transactions are at two adjacent security levels the consequence is the minimum. 
Of course, if the two transactions are at the same security level, there is no covert 
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channel, and hence no consequence. We introduce a metric to measure the 
consequence of introducing a covert channel, to be known as the Covert Channel 
Factor (CCF). The CCF is obtained by normalizing the difference of access levels 
between the two conflicting transactions: 

QQY - Difference in access levels 
Maximum difference possible 

_ Difference in access levels 
# of access levels - 1 

The maximum value of CCF is 1 when the two transactions are at two terminal 
access levels. The minimum value of CCF is l/(# of access levels -1) when the 
two transactions are at two consecutive security levels. Of course, if two 
transactions are at the same security level, CCF is obviously zero meaning no 
covert channel. 

2.2 Security tolerance 

The CCF gives a measure of security violation. The greater the CCF, the greater 
the difference in access levels in a security conflict and hence, according to the 
covert channel property, the greater the severity of security violation. Depending 
upon the system requirements, security violation may or may not be tolerated. The 
security tolerance is defined as the maximum security violation a system permits. 
Since a security violation is measured in terms of CCF, so is the security tolerance. 
A security tolerance of 0 means the system does not allow any covert channel and 
a security tolerance of 1 means the system allows all possible covert channels. In 
other words, the security tolerance is the value of a CCF that corresponds to the 
upper limit of security violation in a system. For example, assume a system only 
permits the covert channels between two consecutive access levels. In this case, 
any covert channel having the difference in access levels greater than 1 is not 
allowed. The security tolerance in this system will be the value of the CCF 
corresponding to the covert channel with one access level difference, i.e. 



Security tolerance = 



Difference in access levels 
Maximum difference possible 



^ 1 

# of access levels - 1 

The smaller the value of tolerance, the more important is the security and vice 
versa. In the next section we will see how we can use security tolerance to 
represent the importance of security. In a security conflict, if the CCF is greater 
than the tolerance, then a conflict is resolved in favor of security, otherwise the 
conflict is resolved in favor of priority based on the real-time constraints. 
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3 SECURE CONCURRENCY CONTROL ALGORITHMS 

As mentioned earlier, secure real-time database systems have to satisfy the security 
constraints in addition to the real-time constraints. Security can be thought of as a 
correctness criteria where security must be enforced. It can also be thought of as a 
compromising criteria with real-time constraints where security can be sacrificed 
to maintain more real-time constraints. The algorithm we present here supports 
both types of security. Before describing our algorithm, we give a brief 
introduction to the existing non-secure algorithms upon which our algorithm is 
based. This algorithm is the 2PLHP (Abbot 1992) and our algorithm is to be 
known as the Secure 2PLHP algorithm. They are described in the following sub- 
sections. 

3.1 2PLHP 

The 2PL High Priority (2PLHP) algorithm is a modification of the strict two-phase 
locking algorithm (2PL) (Abbot 1992) and incorporates the priority of transactions. 
The priority of a transaction is based on its real-time constraints. The earlier its 
deadline, the higher its priority. When a transaction requests a lock on a data item 
that is held by one or more higher priority transactions, the requesting transaction 
waits for the data item to be released. If the data item is locked by only lower 
priority transactions, the lower priority transactions are aborted and restarted with 
the same deadline, and the lock is granted to the requesting transaction. If priority 
is unique, 2PLHP is deadlock free. 

3.2 Secure 2PLHP 

The 2PLHP algorithm does not recognize security. To incorporate security we 
examined all the scenarios involving deadline and access-levels between lock- 
holding and lock-requesting transactions. For secure real-time database systems 
five types of conflict can occur. We now describe the strategy taken by our 
algorithm for each conflict. Assume T1 is the lock-requesting and T2 is the lock- 
holding transaction. 

1. Deadline(Tl) >Deadline(T2) and Access level(Tl)> Access level(T2): In this 
case the requesting transaction is at a lower priority and a higher security 
level. We can abort or block the requester. There will not be any covert 
channel or priority violation. 

Block T1 //priority and security maintained 

2. Deadline(Tl )>Deadline(T2) and Access level(Tl)<Access level(T2): In this 
case the requester is at a lower security level and a covert channel will be 
introduced if it is blocked or aborted. However, if the requester is allowed to 
proceed and the lock-holder aborted, priority will be violated. In this case, we 
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compute the CCF if T1 is aborted or blocked. If the CCF is greater than the 
tolerance then the lock-holder (T2) is aborted, otherwise the lock-requester 
(Tl) is blocked. 

If CCF>tolerance then 

Abort T2 // security maintained 

Else 

Block Tl //priority maintained 

3. Deadline(Tl )<Deadline(T2) and Access level(Tl)> Access level(T2): In this 
case, the requester is at a higher priority and at a higher access-level and a 
covert channel will be opened if the lock holder is aborted. Here we need to 
compute the CCF if T2 is aborted. If the CCF is greater than the tolerance, 
then T2 is allowed to proceed and Tl aborted, otherwise Tl is granted the lock 
and T2 is aborted. 

If CCF>tolerance then 

Abort Tl //security is maintained 

Else 

Abort T2 //priority is maintained 

4. Deadline(Tl )<Deadline(T2) and Access level(Tl )<Access level(T2): In this 
case the requester is at a lower security and a higher priority, and we can 
resolve the conflict by aborting T2. Priority is maintained and no covert 
channel is introduced. 

Abort T2 //priority and security maintained 

5. Access level(Tl) = Access level(T2): In this case, two transactions are at the 
same security level and therefore there is no covert channel. The conflict is 
resolved according to the 2PLHP algorithm. 

Ifdeadline(Tl)<deadline(T2) then 
Abort T2 

Else 

Block Tl 

3.3 Choice of tolerance values 

The choice of a tolerance value is very important and it provides a way to control 
priority and security maintained in the system. The smaller the value of tolerance, 
the more important it is to maintain security and the greater the number of times a 
conflict is resolved in favor of security. By choosing an appropriate value for 
tolerance, the system can be maintained 100% covert channel free with every data 
conflict resolved in favor of the transaction at the lower security level. In order for 
that to happen, the condition in the if-then in cases 2 and 3 in section 3.2 must be 
true for every value of CCF. In other words, the minimum CCF should be larger 
than the tolerance value, i.e., the tolerance should be smaller than the minimum 
CCF. Any tolerance value greater than that will allow some violation of security. 
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3.4 Metrics of security maintenance 

We now introduce two metrics or security factors to measure the security 
maintenance of a system. One metric keeps track of the number of times security 
has been maintained. The other one recognizes the differences between the access 
levels. 

^ ^ ^ # of times security is maintained 

Security Factor 1 = — — ; — 

Total number of security conflicts 

Security Factor 2 

Sum of the difference in access levels for conflicts having security maintained 
Sum of the difference in access levels in all security conflicts 
Both the metrics are suitable for measuring the performance of the system. 
Depending upon the system, one metric might be more appropriate than the other. 
If only the number of conflicts maintained is of concern the first factor is 
appropriate. The second factor is appropriate in systems where difference in 
access-levels is crucial. In this study, we choose the security factor 2. 

3.5 Metric of real-time constraint maintenance 

When deadlines are missed, the temporal data is not updated in time and data 
becomes temporally inconsistent. For this study we will use the percentage of 
deadlines missed as one of the measures of the maintenance of the real-time 
constraints. 

We also use the priority maintenance factor as a second measure. In order to 
express the level of priority maintenance in a system, we use the following priority 
maintenance factor. 

# of times priority is maintained 

Priority maintenance factor = — — ; ^ — — 

Total number of data conflicts 

This metric is used to determine how priority maintenance affects the real-time 
performance. 

4 SIMULATION MODEL 

This section outlines the structure and details of our simulation model used to 
evaluate the performance of our concurrency control algorithms for real-time 
database systems. Central to our simulation model is a single-site main memory 
database system operating on a single processor. The database is modeled as a 
collection of data pages in memory. The simulation consists of three main 
components: a Transaction Manager (TM), a CPU Manager (CM), and a Log 
Manager (LM). The TM is responsible for issuing lock requests, CM for granting 
CPU access, and LM for log disk access. The service discipline used for the queues 
is Earliest Deadline First (EDF) (Liu 1979) without preemption. Each transaction 
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consists of multiple operations each of which can be either read or write. If the 
operation is read, then the accessed page is not updated. The write operation 
updates the accessed page and an entry is written into the log buffer. 

When an operation of a transaction makes a data access request, i.e., lock request 
on a data object to the TM, the request goes through concurrency control to obtain 
a lock on the data object. If the request is granted, the transaction requests CPU 
access to the CM. If the CPU is free, the request is granted and the transaction does 
the CPU computation. After the CPU computation, if there are any more 
operations left, the transaction proceeds with the next operation and makes a lock 
request to the TM. If all operations are done, the transaction requests log disk 
access to the LM and if access is granted, it writes the log buffer to the log disk and 
commits. 

If the request for the lock is denied, the transaction will be placed into a block 
queue. The blocked transaction will be awakened when the requested lock is 
released. 

4.1 Parameters of simulation model 

Table 1 gives the system resource parameters. The parameter CPU_TIME is the 
time to process a page by a CPU. The simulation does not explicitly take into 
account the time required for accessing the transaction manager, the CPU manager, 
and the log manager. It is assumed that those times are included in the time 
required to access the resources, i.e., the CPU, and the log disk. 

Table 1 System resource parameters 



Parameter 


Explanation 


Value 


DBSIZE 


Number of data pages in the database 


400 


CPU.TIME 


CPU time for processing a data page 


5 msec 


WRITE.PROB 


Probability that an operation is write 


0.5 


MAXACCESS 


Number of security access levels 


6 



Table 2 summarizes the workload parameters that characterize the transactions and 
the system workload. Transactions’ inter-arrival rates are exponentially distributed. 
The Rate parameter specifies the mean rate of transaction arrivals. The TransSize 
determines the mean number of operations in the transactions determined from a 
normal distribution with mean of TransSize. The actual data objects or pages 
accessed by each operation are uniformly distributed across the whole database. 
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The LogDelay is the time required for writing a log buffer to the log disk. The 
RestartDelay is the overhead for roll back when a transaction is aborted. The 
parameters MinSlack and MaxSlack are used to set the lower and upper bound of 
transactions’ deadlines. The following deadline assignment formula (Abbot 1992) 
is used to assign the deadline to the arrived transaction. 

Deadline = Arrival time + Uniform(MinSlack,MaxSlack)*Execution time 
The Arrival time is the time of arrival of each transaction. The Execution time of a 
transaction is calculated from the data requirements in all the operations using 
TranSize, CPUJTIME, and LogDelay. 

Table 2 Workload parameters 



Parameters 


Meaning 


Value 


Rate 


Arrival rate of the transactions 


[5,50] 


TransSize 


Average transaction size 


6 


LogDelay 


Overhead for log disk access 


1 unitofCPU.TIME 


RestartDelay 


Overhead for restarting 


1 unit of CPU.TIME 


MinSlack 


Minimum slack factor 


2 


MaxSlack 


Maximum slack factor 


8 



4.2 Experimental setup 

The simulation program is written in C++ using the next event simulation strategy 
and is run for 5000 transactions. Different random seeds are used for different calls 
to the random number generator to make sure that the arrived transactions are 
exactly the same for different algorithms. We simulate a firm real-time system in 
which a value returned after a deadline is useless. Hence, at the beginning of each 
event, the system is checked to see if there is any transaction that has missed its 
deadline and if so, it is removed from the system. We perform a detailed 
simulation study using our proposed algorithm and compare it with an existing 
non-secure one. We discuss the effectiveness of our algorithms in terms of 
maintaining real-time constraints and security. 







265 



Simulation results 

Figure 1 illustrates the deadline miss percentage for the non-secure 2PLHP and the 
Secure 2PLHP with a tolerance of 0 as the arrival rate increases. Non-secure 
2PLHP is priority cognizant and hence has a better performance over the secure 
2PLHP algorithm. The non-secure algorithm does not have any deadlines missed 
with arrival rates below 20. The secure algorithm transactions start to miss 
deadlines around an arrival rate of 16. The main difference between the 
performance of the two algorithms is prominent between arrival rates 15 and 25, 
after which a majority of the transactions start to miss their deadlines in both the 
algorithms. 




Figure 1 Deadline miss percentage. 

The Secure 2PLHP algorithm recognizes covert channels and, therefore, the 
security factor of a system is improved when the algorithm is used. This result is 
illustrated in the Figure 2, which compares the security factors for the secure and 




Figure 2 Security factors. 
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non-secure algorithm. In the secure algorithm, the security factor is 1 ; in contrast, 
in the non-secure algorithm, it is inconsistent and remains around 0.5. It is 
interesting to notice in Figure 1 that although we enforce security with the Secure 
2PLHP, we do not necessarily have to sacrifice the maintenance of the real-time 
constraints a great deal. This is because there are many data conflicts for which 
security enforcement does not violate priority based on deadline. Increasing 
security means more data conflicts resolved in favor of lower security transactions, 
irrespective of their deadlines. If a lower security transaction has a later deadline, 
enforcing security means loss of priority. However if it happens to have an earlier 
deadline, then priority is maintained as well. Therefore, achieving complete 
security (i.e. security factor 1) does not mean that we fail to meet all real-time 
constraints. Figure 3 supports this claim. It shows the priority maintenance factor 
at different arrival rates when the security factor is 1. In this case the priority 
maintenance factor is not zero but instead varies between 0.2 to 0.6. That is the 
reason why the real-time performance with the Secure 2PLHP Figure 1 is close to 
the non-secure 2PLHP. This is a very important feature of our algorithm. 




Figure 3 Priority maintenance factor 

As the arrival rate increases, the number of data conflicts also increases. When 
tolerance = 0, data conflicts are always resolved in favor of security, and therefore, 
the number of data conflicts resolved in favor of priority does not increase at the 
same rate as the number of total data conflicts. As illustrated in Figure 3, the 
priority maintenance factor decreases with the increase of arrival rate. However 
when the arrival rate increases beyond 16-17, the system starts to miss deadlines 
and transactions are removed from the system as soon as they are late. As a result, 
the number of data conflict decreases with any subsequent increase in arrival rate. 
However because of the increased arrival rate, the number of conflicts resolved in 
favor of priority still increases, which in turn increases the priority maintenance 
factor. TTierefore, once the system starts to miss deadlines, the priority 
maintenance factor increases. 
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Figure 4 shows how security tolerance affects the security factor and the priority 
maintenance factor. The number of access levels in our study is 6 and therefore, 
the minimum possible CCF is 0.2 (section 2.2). As a result, when the tolerance is 
0, the value of CCF in any security conflict is higher than the tolerance. In this 
situation our algorithm resolves a conflict in favor of security. Therefore the 
security factor stays at 1 until the tolerance increases to larger than 0.2. As 




Figure 4 Variation of security and priority factors with tolerance. 

indicated earlier in section 3.3, the smaller the tolerance, the higher the number of 
times conflicts are resolved in favor of security and vice versa. Therefore, the 
security factor decreases with subsequent increase of tolerance values. As the 
security factor decreases, more and more conflicts are resolved in favor of priority, 
and as a result, the priority factor increases. The maximum value of CCF is 1. 
Therefore when the tolerance is greater than 1, any CCF will be smaller than the 
tolerance in which case every conflict is resolved in favor of priority. Thus the 
priority maintenance factor stays at 1 for tolerance greater than 1 . 

Figure 5 shows a comparison of the restart ratios between the non-secure and the 
Secure 2PLHP algorithms. If the arrival rate is low, there are fewer conflicts and 
consequently fewer restarts. However as the transactions start to miss their 
deadlines, some of the aborted transactions may be already late and hence may not 
even restart because they are removed from the system. With an increased arrival 
rate, the number of late aborted transactions increases, which in turn decreases the 
number of restarts. Thus, the restart ratio increases with the increase in arrival rate 
until the system starts missing deadlines, after which the restart ratio begins 
decreasing and it continues to decrease with the subsequent increase in arrival rate. 
Figure 5 also illustrates that the restart ratio changes with the change of tolerance. 
This is explained by cases 2 and 3 of the Secure 2PLHP algorithm described in 
section 3.2 where tolerance is used to resolve the conflict. In case 3, no matter 
what the value of tolerance is, both the options abort transactions and therefore, do 
not change the restart ratio. Only in case 2 are there both block and abort options. 
If the value of tolerance is large, the CCF is more likely to be smaller than the 
tolerance, in which case the requester will be blocked yielding fewer restarts. If 
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tolerance is very large (>1), number of restarts is usually small compared to low 
tolerance cases, and therefore the change in restart ratio is also not as sharp as in 
other cases. 




Figure 5 Restart ratio. 

5 CONCLUSION 

In this paper we proposed a new secure 2PL concurrency algorithm for real-time 
databases. The algorithm can use security as a correctness criteria where security 
must be enforced. It can also be thought of as a compromising criteria with real- 
time constraints where security can be sacrificed to maintain more real-time 
constraints. We have implemented the secure algorithm and a non-secure algorithm 
and studied their performance using a firm real-time database system simulation 
model. We have also introduced metrics to measure security in real-time database 
systems. Results clearly show that our algorithm performs fairly well in terms of 
maintaining real-time constraints and security compared to the non-secure 
algorithm. We also have shown that achieving security does not necessarily mean a 
great deal of sacrifice in maintaining real-time constraints. A system can be made 
100% covert channel free, but can still have a low deadline miss percentage for an 
arrival rate as high as 20. In the future we will examine new measures for temporal 
consistency, design suitable concurrency control algorithms and study their 
performance. 
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Abstract 

Among information system stakeholders, there are a variety of questions about the 
meaning of assurance (as the term pertains to information security), the means by 
which assurance is obtained, the means by which degrees of assurance can be 
differentiated, and the determination of a suitable level of investment specifically 
for building assurance. This paper identifies differences among stakeholders’ 
perceptions, which contribute to current assurance debates, and it proposes a 
model to help clarify assurance expectations in system acquisition, operation, and 
maintenance. 
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1 INTRODUCTION 

The purpose of this paper is to help information systems stakeholders to 
understand ‘assurance.’ Security concerns include not only the protection of 
assets, but also the assurance that protection provides. Assurance is a broad 
concept which is concerned with such questions as: 

• Should you believe that your information system will adequately protect your 
data 

• Should you believe that your information system does more good than harm 

Consumers have difficulty answering these questions because the evidence 
available is undifferentiated and often complex. At times this evidence is 
unbalanced— emphasizing security in one component while ignoring other 
components, or in support of one policy but not another. At other times, the 
evidence is understandable only to a specialist. 

To understand your system’s ability to protect your data it is necessary, not only 
to gather evidence towards an assurance claim but to understand how that 
evidence contributes to your assurance. We discuss assurance by: 

• Identifying some of the system stakeholders and their security concerns 

• Providing a framework for assessing assurance 

• Giving an understanding of the different definitions that are being used for 
assurance 

2 THE STAKEHOLDERS 

The effect of information system misuse on direct users is often stressed, but for a 
given system there may be many groups with a stake in system security. The 
various stakeholders may face different tangible or intangible risks, and thus they 
may have competing interests or goals. These differences and their relationship to 
assurance need to be taken into account, and priorities need to be established. 

To illustrate. Figure 1 identifies the stakeholders concerned with cellular 
telephone systems in the United States. 
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Cellular Phone Stakeholders 
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Figure 1 Cellular Phone Stakeholders. 



2.1 Service Providers 

The most apparent stakeholder organizations are those that operate cellular 
telephone systems. In the industry jargon, they are the service providers. The 
management of a service provider is a stakeholder. Top management is concerned 
with business objectives, such as profitability of the enterprise. We may assume 
that the CEO is concerned with security assurance primarily as it affects business 
objectives. Enterprise policy and major decisions must be made by, or at least 
approved by, the CEO. 

Another class of stakeholders among the service providers is the system 
administrators. These personnel are responsible for operating many different 
technical systems within the enterprise. Some systems are directly involved in 
delivering service to consumers; others provide internal functions such as 
inventory and billing. The system administrators are probably all concerned with 
availability. To some personnel, integrity is more important than availability. 

Additional stakeholders are those who interface between customers and various 
systems. We will use the term help desk for this group. In many organizations the 
help desk staff is strongly encouraged to handle each consumer as rapidly as 
possible, so their major concerns include availability, ease of use, and response 
time. 

2.2 Advocacy Groups 

Customers may be categorized according to the service used, such as traditional 
fixed location telephone, cellular phone, or data transmission, including fax. 
Customers are typically concerned with cost and quality of service. They are also 
concerned with privacy and correctness of billing records. Theft of service is 
always a concern to the management of the service provider. It becomes a 
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customer concern when the cloning of a mobile telephone number and equipment 
serial number results in the customer being billed for services s/he did not use. If 
the customer concern is expressed as an inclination to change service providers or 
to use conventional telephone instead of cellular phone, then management 
becomes concerned. The concerns of the various stakeholders are interrelated. 

Several public interest stakeholders exist. The American Civil Liberties Union 
and the Electronic Freedom Foundation lobby or litigate in support of their view 
of the public interest and the U.S. Bill of Rights. The direct effect of these 
stakeholders is felt by the service provider, which may be forced to change service 
offerings. The indirect effects of successful public advocacy are often difficult to 
predict. 

2.3 Vendors 

Two principal classes of vendor are the cellular phone manufacturers and the 
makers of phone switches and other infrastructure equipment. The cellular phone 
providers are interested in the end customers, primarily from the standpoint of 
maintaining market share. 

The manufacturers of the hardware and software for the industry’s 
infrastructure are interested in satisfying the service providers, who are their 
direct customers. There are direct effects on profitability and indirect effects on 
reputation, market share, and legal liability. 

2.4 Government 

A given law enforcement agency may have several objectives. They have an 
interest in insisting on the technical capability to conduct legal wiretaps. They 
have been able to get this interest incorporated into legislation-the U.S. 
Communications Assistance to Law Enforcement Act of 1994 affect 
manufacturers by mandating technical features and assurances in their products. 
The same act also impacts the service providers by requiring them to provide 
wiretap service. 

Through export controls and pressure, certain government stakeholders have 
convinced the manufacturers and service providers to use weak cryptographic 
techniques. There is a concern among the service providers that weak 
cryptographic techniques will not prove sufficient to reduce theft of service. The 
debate has not been resolved. 
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3 ASSURANCE FRAMEWORK 

An assurance framework providing a hierarchical structuring of the assurance 
argument has been developed (Williams, 1995). Many of the principles of that 
approach are relevant to understanding assurance and how it may apply to 
understanding the confidence you have in your system. 

3.1 Assurance Types 

It is important to understand what type of assurance is being reasoned about. The 
following assurance types define the applicability of the assurance towards the 
claim that a system may be used to protect data. Each type of assurance makes 
different claims about the system: 

• Correctness : Correctness refers to claims that the implementation is a 

necessary and sufficient representation of the specification. 

• Effectiveness : Effectiveness refers to claims that the selected security 

functions are suitable for countering the identified threats. 

• Usability : Usability refers to the ease of configuring and using the security 
functions without compromising system security. 

• Workmanship : Workmanship refers to product or system quality relative to 
the state of the art, including maintainability, expandability, and durability. 

3.2 Assurance Subjects 

It is equally important to understand the subject of the assurance claim. The 
following assurance subjects introduce the elements required to develop, evaluate, 
and operate a product or system. Each assurance subject contributes (in a different 
way) to the overall assurance that the system will protect the data with which it 
has been entrusted. 

• Technology : Technology evidence comes from examining a product or 

system and its security mechanisms directly. Examples of technology 
evidence include system architectures, models, test results, evaluation results, 
and configuration parameters. 

• Process : Process evidence comes from examining whether the development, 
evaluation, and operation processes are trustworthy and have been followed. 
Examples of process evidence include defined plans and procedures, process 
metrics, and performance data. 

• Personnel : Personnel evidence comes from examining the individuals and 
organizations in the roles of developers, evaluators, and operators. Examples 
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of personnel evidence include credentials, background checks, hiring 
guidelines, experience data, and training data. 

• Environment : Environment evidence documents reasons that development, 
evaluation, and operation environments are trustworthy. Environments 
should be considered to include tools and facilities. Examples of environment 
evidence include physical protections, tool capabilities, and backup 
mechanisms. 

3.3 Assurance Framework 

The framework illustrated in Table 1 provides a structure for mapping assurance 
subjects to each of the assurance types. Examples of assurance evidence for each 
category are provided, but there are many more. This framework is intended to 
apply equally well to products, systems, mechanisms, components, or any other 
assurance subjects. 

Note that there is no indication of which assurance source is best. While one 
customer may rely heavily on assurance about the correctness of the system gained 
from analyzing the technology being deployed, others may want assurance about 
the effectiveness of the system gained from assessing the people who built it, or 
assurance about the usability of the system gained from reviewing the 
environment in which it was built. 
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Table 1 Assurance Framework 
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4 BARRIERS TO ASSURANCE 

It has been difficult to obtain a sound basis for confidence in the security 
effectiveness of our information systems. Two of the primary reasons are 
discussed below. 

4.1 Lack of Understanding 

This is listed first because it is arguably the fundamental problem. 
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4, L 1 Too Many Definitions for Assurance 

Webster’s dictionary says: 

Assurance - 1. the act of assuring. 2. that state of being assured; sureness; 
confidence; certainty. 3. something said or done to inspire confidence, as a 
promise, positive statement, etc.; guarantee 

According to Webster, assurance can be both something done to inspire 

confidence and the state of being confident. The information security community 

uses assurance in both of these ways and even more: 

1. The confidence that the information system is effective in meeting its security 
objectives. In this usage assurance is a measure of how sure one is that the 
system will do what it is supposed to do and not do what it is not supposed to 
do. 

2. The above usage plus confidence that the objectives themselves are correct. 
Even when using assurance as confidence, the extent of what one is confident 
about varies, but the same word, assurance, is used. 

3. A specific type of measure that provides a basis for having confidence. This is 
distinctly different from the subjective nature of 1 and 2 above. Here assurance 
is an objective measurement related to the information system not a measure of 
confidence. 

When used in this manner, assurance relates to a specific type of measurement. 
While the individual probably understands that other measures are possible, in 
practical terms assurance frequently becomes narrowly defined. 

4. Collection of measures of or facts about a system that provide a basis for 
having confidence. This is very similar to 3 above, but includes multiple types 
of measurement and fact in the practical, working definition. 

5. The inherent security quality of the information system. The term assurance is 
used, not as confidence or a metric, but as a statement of a system 
characteristic. This is distinctly different from how one feels about the system 
(confidence) and from measurements of the system. 

In summary, assurance is being used to mean: 

• A subjective measure of human confidence. 

• An objective measurement of or fact about an information system. 
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• A system characteristic which exists independent of confidence in the system 
or any measurement of, or fact about, the system. 

Information system stakeholders are advised to explicitly specify which 
definitions they are using when they address assurance requirements. 

4.1.2 Assurance-A system requirement 

Information security is frequently seen, not as a system requirement, but as a 
hindrance to be minimized (or avoided if possible). The information systems 
environment has radically changed from stand-alone and supporting to 
interconnected and integrated. This change greatly increases the potential for our 
information systems to cause us harm. Diligent management of information- 
related risks is not a hindrance, but an essential element in the use of information 
technology. 

4.2 Organizational Pressures 

Organizational pressures may form barriers to information security; for example: 

• Those who know do not decide 

• Those who decide do not suffer 

• Those who should care are engaged elsewhere 

4. 2. 1 Those who know don T decide 

In many organizations those who have the technical experience to understand 
information systems security (at least in the manner in which it has been relayed 
to date) are not those who make the buying decisions for the organization. Those 
who have, by virtue of training and experience, the capability to understand 
computing security issues are removed from the process owners who make the 
trade-off decisions between functionality and dependability. 

4. 2. 2 Those who decide don T suffer 

Decisions about which information technology product to install or how to 
automate a specific process are frequently made by individuals who do not feel the 
security impact of these choices. The support group is often quite removed from 
direct knowledge of or concern about the specifics of the business or mission 
process. These individuals simply do not concern themselves with business 
impacts. Instead their concern is directly related to the technology. They care 
deeply if the system goes off line, but not because they understand the business 
impact. 
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4.2,3 Those who should care are engaged elsewhere 

The essence of risk management is to achieve a cost-effective tradeoff between 
business/mission risks and security countermeasures. Typically, the only 
individuals who can decide whether a given countermeasure is cost-effective are 
the owner of the process being automated. In many organizations these 
individuals are fully engaged in running the business or in accomplishing the 
mission. Like the information systems organization, the process owners have 
come to see information security in terms of machines and files. As such, these 
process owners see information security as a technical detail on which they cannot 
afford to spend their time. 



5 SUMMARY 

This paper has focused on introducing assurance issues. We have provided a 
summary of the concerns of various stakeholders, a framework for assurance, and 
a description of some of the common barriers to information security. 

The list of references contains sources of additional information on assurance. 
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General Information 





IFIP TC-11 



The International Federation for Information Processing (IFIP) is a multinational 
federation of professional and technical organizations (or national groupings of 
such organizations) concerned with information processing. IFIP consists of 
approximately 50 member organizations, representing some 60 countries. Eleven 
societies, associations, federations or councils are affiliate members to IFIP. 

IFIP was founded under the auspices of UNESCO. Its official relationship with 
that organization is classified as category B, that is able to advise in a particular 
field. IFIP established official relations with the World Health Organization in 
February 1972 and maintains informal relationships with other members of the UN 
family. IFIP has the status of a Scientific Affiliate of the International Council of 
Scientific Unions (ICSU). 

In 1970, IFIP together with four sister federations, IFAC, IFORS, IMACS and 
IMEKO, established a Five International Associations Co-ordinating Committee 
(FIACC) which provides a basis for the cordial and successful co-ordination of a 
variety of activities of mutual interest. IFIP also participates in an advisory capacity 
in the work of CCITT, the International Telegraph and Telephone Consultative 
Committee. 

IFIP Technical Committee 1 1 on Security and Protection in Information Systems 
was created in 1983 under the chairmanship of the late Kristian Beckman of 
Sweden. Representatives from 28 countries that are members of this committee 
meet at least once a year at the IFIP SEC conferences that are held in different 
member countries. 



IFIP TC-11 Aim and Scope 

The aim of TC-11 is to increase the reliability and general confidence in 
information processing as well as to act as a forum for security managers and others 
professionally active in the field of information processing security. 

The scope of TC-1 1’s activities will include: 

• the establishment of a common frame of reference for security in organizations, 
professions and the public domain; 

• the exchange of practical experience in security work; 
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• the dissemination of information on and the evaluation of current and future 
protective techniques; 

• the promotion of security and protection as essential elements of information 
processing systems. 

In order to accomplish its objectives, TC-11 has established a number of working 
groups (WG’s) to address specific areas of security interest. Special task forces 
(TF’s) are installed when a topical subject requires a reaction or standpoint from 
TC-11. 



IFIP TC-11 general information 



Chairman: 

Vice-chairman: 

Secretary: 



prof.dr. Sebastiaan von Solms, Rand Afrikaans University, 
Johannesburg, South-Africa 

prof.dr. Reinhard Posch, Technical University Graz, Austria 
mr. John Beatson, Wellington, New Zealand 



The official journal of TC-11 is Computers & Security, published by Elsevier 
Advanced Technology. ISSN 0167-4048 



More information 

All current information about IFIP can be found on or via IFIP’s homepage: 
http://www.ifip.or.at 

All current information about TC-11 can be found on or via TC-lTs homepage: 
http://www.ifip.tu-graz.ac.at/TCl 1 

The IFIP secretariat can be reached at: 
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IFIP TC-11 working groups 



WG 11.1 INFORMATION SECURITY MANAGEMENT 

Chair: prof.dr. Rossouw von Solms, Port Elizabeth Technikon, South- Africa 

Aim 

As management, at any level, may be increasingly held answerable for the reliable 
and secure operation of the information systems and services in their respective 
organizations in the same manner as they are for financial aspects of the enterprise, 
the Working Group will promote all aspects related to the management of 
information security. 

These aspects cover the wide range, from the pure managerial aspects concerning 
information security, like upper management awareness and responsibility for 
establishing and maintaining the necessary policy documents, to more technical 
aspects like risk analysis, disaster recovery and other technical tools to support the 
information security management process. 

Scope 

The scope of the working group shall be to: 

• study and promote methods to make senior business management aware of the 
value of information as a corporate asset, to realise the risks involved with this 
corporate asset, and to get their commitment to implementing and maintaining 
the necessary objectives and policies to protect these assets; 

• study and promote methods and ways to measure and assess the security level in 
a company and to convey these measures and assessments to management in an 
understandable way; 

• research and develop new ways to identify the information security threats and 
vulnerabilities which every organization must face; 

• research and identify the effect of new and changed facilities and functions in 
new hardware and software on the management of information security ; 

• study and develop means and ways to help information security managers to 
assess their effectiveness and degree of control; 

• address the problem of standards for information security. 
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WG 11.2 SMALL SYSTEMS SECURITY 

Chair: prof.dr. Jan Eloff, Rand Afrikaans University, Johannesburg, South Africa 

Aim 

To investigate methods and issues in the area of information security, particularly 

related to small systems; and to advance knowledge and awareness of the subject 

through publications, conferences and other means. The aim is to address small 

systems security from both a functional and technical perspective. 

Scope 

The scope of the working group shall be to: 

• promote the design of the new information security techniques and methods in 
systems where the functionality and responsibility for secure systems are 
distributed to the end user; 

• investigate and report on the information security aspects of information 
technology products and information services for end users and consumers; 

• address the information security aspects in systems which could technically be 
described within the range from bits-like systems such as intelligent tokens up 
to desktop type workstations; 

• design guidelines and promote methodologies for the implementation of 
information security in small organizations; 

• investigate intelligent token and smart card applications in information security 
with the aim of making the user less dependent on a shared environment. 



WG 1 1.3 DATABASE SECURITY 

Chair: prof.dr. John Dobson, University of Newcastle, Newcastle upon Tyne, UK 

Aim and Scope 

The aim and scope of the working group shall be: 

• to advance technologies that support: 

• the statement of security requirements for database systems; 

• the design, implementation, and operation of database systems that 
include security functions; 

• the assurance that implemented database systems meet their security 
requirements; 

• to promote wider understanding of the risks to society of operating database 
systems that lack adequate measures for security or privacy; 

• to encourage the application of existing technology for enhancing the security 
of database systems; 
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WG 11,4 NETWORK SECURITY 

Chair: prof. dr. Sokratis Katsikas, University of the Aegean, Samos, Greece 

Aim 

to study and promote internationally accepted processes which will enable 
management and technicians to fully understand their responsibility in respect of 
the reliable and secure operation of the information networks which support their 
organizations, their customers or the general public; 

to study and promote education and training in the application of security 
principles, methods, and technologies to networking. 

Scope 

to promote the awareness and understanding of the network aspect of information 
systems security; 

to provide a forum for the discussion, understanding and illumination of network 
security matters; 

to study and identify the managerial, procedural and technical aspects of network 
security; and hence to define the network security issues; 

to study and describe the risks that arise from embedding an information system in 
a network environment; 

to advance technologies and practices that support network security controls, make 
possible the statement of requirements for network security, and in general, 
advance the foundation for effective network security; 

to contribute, as feasible and appropriate, to international standards for network 
security. 



WG 11.5 SYSTEMS INTEGRITY AND CONTROL 

Chair: Leon Strous, De Nederlandsche Bank, Amsterdam, The Netherlands 

Aim 

To promote awareness of the need to ensure proper standards of integrity and 
control in information systems in order to ensure that data, software and, ultimately, 
the business processes are complete, adequate and valid for intended functionality 
and expectations of the owner (i.e. the user organisation). 

Scope 

• study and promote the research and use of standard mechanisms / measures to 
ensure that data integrity requirements in information systems and their use in 
business are satisfied; 

• study and promote the use of standard evaluation criteria to define the integrity 
and control requirements; 
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• study and promote the use of advanced tools and techniques as a means to 
identify integrity and control weaknesses; 

• study and promote the use of advanced tools and techniques to support the work 
of internal and external auditors; 

• promote the mutual understanding of the edp-audit, security and development 
functions between personnel engaged in those functions and to the wider 
business community. 



WG 11.8 INFORMATION SECURITY EDUCATION 

Chair: prof. dr. Louise Yngstrom, University of Stockholm, Sweden 

Aim 

To promote information security education and training at the university level and 

in government and industry. 

Scope 

The scope of the working group shall be to: 

• establish an international resource center for the exchange of information about 
education and training in information security ; 

• develop model courses in information security at the university level; 

• encourage colleges and universities to include a suitable model course in 
information security at the graduate and/or undergraduate level in the 
disciplines of computer science, information systems and public service; 

• develop information security modules that can be integrated into a business 
educational training programme and/or introductory computer courses at the 
college or university level; 

• promote an appropriate module about information security to colleges and 
universities, industry and government; 

• collect, exchange and disseminate information relating to information security 
courses conducted by private organizations for industry; 

• collect and periodically disseminate an annotated bibliography of information 
security books, feature articles, reports, and other educational media. 
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